18
Research Article Algorithm and Architecture Optimization for 2D Discrete Fourier Transforms with Simultaneous Edge Artifact Removal Faisal Mahmood , 1 Märt Toots, 2 Lars-Göran Öfverstedt, 2 and Ulf Skoglund 2 1 Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA 2 Structural Cellular Biology Unit, Okinawa Institute of Science and Technology (OIST), Okinawa, Japan Correspondence should be addressed to Faisal Mahmood; [email protected] Received 18 December 2017; Revised 11 May 2018; Accepted 10 June 2018; Published 6 August 2018 Academic Editor: Jo˜ ao Cardoso Copyright © 2018 Faisal Mahmood et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Two-dimensional discrete Fourier transform (DFT) is an extensively used and computationally intensive algorithm, with a plethora of applications. 2D images are, in general, nonperiodic but are assumed to be periodic while calculating their DFTs. is leads to cross-shaped artifacts in the frequency domain due to spectral leakage. ese artifacts can have critical consequences if the DFTs are being used for further processing, specifically for biomedical applications. In this paper we present a novel FPGA-based solution to calculate 2D DFTs with simultaneous edge artifact removal for high-performance applications. Standard approaches for removing these artifacts, using apodization functions or mirroring, either involve removing critical frequencies or necessitate a surge in computation by significantly increasing the image size. We use a periodic plus smooth decomposition-based approach that was optimized to reduce DRAM access and to decrease 1D FFT invocations. 2D FFTs on FPGAs also suffer from the so-called “intermediate storage” or “memory wall” problem, which is due to limited on-chip memory, increasingly large image sizes, and strided column-wise external memory access. We propose a “tile-hopping” memory mapping scheme that significantly improves the bandwidth of the external memory for column-wise reads and can reduce the energy consumption up to 53%. We tested our proposed optimizations on a PXIe-based Xilinx Kintex 7 FPGA system communicating with a host PC, which gives us the advantage of further expanding the design for biomedical applications such as electron microscopy and tomography. We demonstrate that our proposed optimizations can lead to 2.8× reduced FPGA and DRAM energy consumption when calculating high-throughput 4096 × 4096 2D FFTs with simultaneous edge artifact removal. We also used our high-performance 2D FFT implementation to accelerate filtered back-projection for reconstructing tomographic data. 1. Introduction Discrete Fourier Transform (DFT) is a commonly used and vitally important function for a vast variety of applications including, but not limited to, digital communication systems, image processing, computer vision, biomedical imaging, and biometrics [1, 2]. Fourier image analysis simplifies computa- tions by converting complex convolution operations in the spatial domain to simple multiplications in the frequency domain. Due to the fundamental nature of 2D DFTs, they are commonly used in a variety of image processing appli- cations such as tomographic image reconstruction [3], non- linear interpolation, texture analysis, tracking, image quality assessment, and document analysis [4]. Because of their computational complexity, DFTs oſten become a computa- tional constraint for applications requiring high-throughput and real-time or near real-time operations, specifically for machine vision applications [5]. Image sizes for many of these applications have also increased over the years, further contributing to the problem. e Cooley-Tukey fast Fourier transform (FFT) algo- rithm [6], first proposed in 1965, reduces the complexity of DFTs from O( 2 ) to O( log ) for a 1D DFT. However, in the case of 2D DFTs, 1D FFTs have to be computed in two dimensions, increasing the complexity to O( 2 log ), thereby making 2D DFTs a significant bottleneck for real-time machine vision applications [7]. Recently, there has been sub- stantial effort to achieve high-performance implementations of multidimensional FFTs to overcome this constraint [5, 7– 14]. Due to their inherent parallelism and reconfigurability, Field Programmable Gate Arrays (FPGAs) are attractive targets for accelerating FFT computations. Being a highly Hindawi International Journal of Reconfigurable Computing Volume 2018, Article ID 1403181, 17 pages https://doi.org/10.1155/2018/1403181

Algorithm and Architecture Optimization for 2D Discrete ...downloads.hindawi.com/journals/ijrc/2018/1403181.pdf · Algorithm and Architecture Optimization for 2D Discrete Fourier

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Algorithm and Architecture Optimization for 2D Discrete ...downloads.hindawi.com/journals/ijrc/2018/1403181.pdf · Algorithm and Architecture Optimization for 2D Discrete Fourier

Research ArticleAlgorithm and Architecture Optimization for 2D DiscreteFourier Transforms with Simultaneous Edge Artifact Removal

Faisal Mahmood 1 Maumlrt Toots2 Lars-Goumlran Oumlfverstedt2 and Ulf Skoglund2

1Department of Biomedical Engineering Johns Hopkins University Baltimore MD 21218 USA2Structural Cellular Biology Unit Okinawa Institute of Science and Technology (OIST) Okinawa Japan

Correspondence should be addressed to Faisal Mahmood faisalmjhuedu

Received 18 December 2017 Revised 11 May 2018 Accepted 10 June 2018 Published 6 August 2018

Academic Editor Joao Cardoso

Copyright copy 2018 FaisalMahmood et alThis is an open access article distributed under theCreativeCommonsAttribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Two-dimensional discrete Fourier transform (DFT) is an extensively used and computationally intensive algorithm with a plethoraof applications 2D images are in general nonperiodic but are assumed to be periodic while calculating their DFTs This leadsto cross-shaped artifacts in the frequency domain due to spectral leakage These artifacts can have critical consequences if theDFTs are being used for further processing specifically for biomedical applications In this paper we present a novel FPGA-basedsolution to calculate 2D DFTs with simultaneous edge artifact removal for high-performance applications Standard approachesfor removing these artifacts using apodization functions or mirroring either involve removing critical frequencies or necessitatea surge in computation by significantly increasing the image size We use a periodic plus smooth decomposition-based approachthat was optimized to reduce DRAM access and to decrease 1D FFT invocations 2D FFTs on FPGAs also suffer from the so-calledldquointermediate storagerdquo or ldquomemory wallrdquo problem which is due to limited on-chip memory increasingly large image sizes andstrided column-wise external memory access We propose a ldquotile-hoppingrdquo memory mapping scheme that significantly improvesthe bandwidth of the external memory for column-wise reads and can reduce the energy consumption up to 53 We tested ourproposed optimizations on a PXIe-basedXilinxKintex 7 FPGA system communicatingwith a host PC which gives us the advantageof further expanding the design for biomedical applications such as electron microscopy and tomography We demonstrate thatour proposed optimizations can lead to 28times reduced FPGA and DRAM energy consumption when calculating high-throughput4096 times 4096 2D FFTs with simultaneous edge artifact removal We also used our high-performance 2D FFT implementation toaccelerate filtered back-projection for reconstructing tomographic data

1 Introduction

Discrete Fourier Transform (DFT) is a commonly used andvitally important function for a vast variety of applicationsincluding but not limited to digital communication systemsimage processing computer vision biomedical imaging andbiometrics [1 2] Fourier image analysis simplifies computa-tions by converting complex convolution operations in thespatial domain to simple multiplications in the frequencydomain Due to the fundamental nature of 2D DFTs theyare commonly used in a variety of image processing appli-cations such as tomographic image reconstruction [3] non-linear interpolation texture analysis tracking image qualityassessment and document analysis [4] Because of theircomputational complexity DFTs often become a computa-tional constraint for applications requiring high-throughput

and real-time or near real-time operations specifically formachine vision applications [5] Image sizes for many ofthese applications have also increased over the years furthercontributing to the problem

The Cooley-Tukey fast Fourier transform (FFT) algo-rithm [6] first proposed in 1965 reduces the complexity ofDFTs from O(1198992) to O(119899 log 119899) for a 1D DFT However inthe case of 2D DFTs 1D FFTs have to be computed in twodimensions increasing the complexity toO(1198992 log 119899) therebymaking 2D DFTs a significant bottleneck for real-timemachine vision applications [7] Recently there has been sub-stantial effort to achieve high-performance implementationsof multidimensional FFTs to overcome this constraint [5 7ndash14] Due to their inherent parallelism and reconfigurabilityField Programmable Gate Arrays (FPGAs) are attractivetargets for accelerating FFT computations Being a highly

HindawiInternational Journal of Reconfigurable ComputingVolume 2018 Article ID 1403181 17 pageshttpsdoiorg10115520181403181

2 International Journal of Reconfigurable Computing

flexible platform FPGAs can fully exploit the parallel natureof the FFT algorithm 2D FFTs are generally calculated instages where all elements of the first stage must be availablebefore the second stage can be calculatedThis creates the so-called ldquointermediate storagerdquo problem associated with stridedexternal memory access specifically for large datasets

While calculating 2D DFTs it is assumed that the imageis periodic which is usually not the case The nonperiodicnature of the image leads to artifacts in the Fourier transformusually known as edge artifacts or series termination errorsThese artifacts appear as several crosses of high-amplitudecoefficients in the frequency domain as seen in [15 16]Such edge artifacts can be passed to subsequent stages ofprocessing and in biomedical applications they may leadto critical misinterpretations of results Efficiently removingsuch artifacts without compromising resolution is a majorproblem Moreover simultaneously removing these spuriousartifacts while calculating the 2D FFT adds to the existingcomplexity of the 2D FFT kernel

Contributions In this paper we present solutions for ahigh-performance 2D DFT with simultaneous edge artifactremoval (EAR) for applications which require high framerate 2D FFTs such as real-time medical imaging systems andmachine vision for control Our proposed optimizations leadto a high-performance solution for removing edge artifactswhile the transform is being calculated thus preventingtime consuming andpossibly erroneous postprocessing stepsMoreover the proposed optimizations reduce the overallenergy consumption This work builds on our previous workpresented in [8] Major contributions include the following

(1) We propose optimized periodic plus smooth decom-position (OPSD) as an optimization for standardperiodic plus smooth decomposition (PSD) for edgeartifact removal (Section 4)

(2) Based on OPSD we propose an architecture thatcan reduce the access to DRAM and can decreasethe number of 1D FFT invocations by performingcolumn-by-column operations on the fly (Section 4)

(3) Since OPSD is heavily dependent on efficient FPGA-based 2D FFT implementation which is limitedby DRAM access problems we design a memorymapping scheme based on ldquotile-hoppingrdquo whichcan reduce row activation overhead while accessingcolumns of data from the DRAM (Sections 5 and 6)

(4) The proposed OPSD and memory ldquotile-hoppingrdquooptimizations also lead to better energy performanceas compared to row-major access (Section 64)

(5) We use our implementation as an accelerator forfiltered back-projection (FBP) an analytical tomo-graphic reconstruction method and show that forlarge datasets our 2D FFT with edge artifact removal(EAR) can significantly improve reconstruction runtime (Section 7)

As compared to our previous work [8] the currentimplementation achieves better runtime ie 15ms as com-pared to 324ms for a 512 times 512 image This increased

performance is achieved by using state-of-the-art hardwarefor high-bandwidth communication between the FPGA andCPU modules and by utilizing an efficient memory map-ping scheme The current work also analyzes the energyconsumption of the proposed paradigm Moreover we alsoshow how the real-time 2D FFTs can be used for tomographicreconstructions

Paper Outline The paper follows FPGA image processingdesign methodology outlined in [16 17] which involvescarefully profiling the software solution to understand com-putational bottlenecks and overcoming them through carefulreformulation of the algorithm within a parallel hardwareframework Section 2 gives a comprehensive backgroundof high-performance 2D FFTs using FPGAs the DRAMintermediate storage problem and edge artifacts Section 3presents PSD in detail which is the implementation objectiveSection 4 presents OPSD an optimized solution to reduce thelatency of the serial part of the algorithmwhich limits overallperformance Section 5 presents a memory mapping schemethat can reduce the column-wise strided external memoryaccess Section 6 presents experimental results and explainstarget selection in detail experimental setup and bench-marks results Section 7 presents filtered back-projection asa proposed application Section 8 presents conclusions

2 Background

21 High-Performance 2D FFTs Using FPGAs There areseveral resource-efficient high-throughput implementationapproaches of multidimensional DFTs on a variety of differ-ent platformsMany of thesemethods are software-based andhave been optimized for efficient performance on general-purpose processors (GPPs) for example Intel MKL [11]FFTW [9] and Spiral [10] Implementations on GPPs canbe readily adapted for a variety of scenarios However GPPsconsume more power as compared to dedicated hardwareand are not ideal for real-time embedded applicationsSeveral application-specific integrated circuit- (ASIC-) basedapproaches have also been proposed [18ndash20] but since it isnot easy tomodifyASICs they are not cost-effective solutionsfor rapid prototyping of image processing systems GraphicalProcessing Units (GPUs) on the other hand can achieverelatively high throughput but are energy inefficient and limitthe portability of large-scale imaging systems

Due to their inherent parallelism and reconfigurabilityFPGAs are attractive for accelerating FFT computationssince they fully exploit the parallel nature of the FFT algo-rithm FPGAs are particularly an attractive target for medicaland biomedical imaging apparatus and instruments suchas electron microscopes and tomographic scanners Suchdevices do not have to be manufactured in bulk to justifyapplication-specific solutions and require high bandwidthMoreover increasing mobility and portability constitute afuture objective for many medical imaging systems FPGAsare also more efficient for prototyping machine vision appli-cations since they are relatively more fine-grained whencompared to GPPs and GPUs and can serve as a bridge

International Journal of Reconfigurable Computing 3

Step 1

Inte

rmed

iate

Sto

rage

Stor

age

Optionalfor streaming access

Row-by-Row 1D FFTs Column-by-Column 1D FFTs

1D F

FTs

1D FFTs

Step 2

n

nn

n

(a) Dataset level view

Accessing Each Row toreadwrite just one

element (Worst Case

Jumping between n DRAMRows to access a single column

(Worst Case Scenario)

Scenario)

Fast Streaming RowAccess An entire row

to data is read into theRow Buffer and

utilized

Step 1 Row Access from DRAM (Fast - Streaming) Step 2 Column Access from DRAM (Slow - Strided)

DRAM Row BufferDRAM Row Buffer

(b) Memory level view

Figure 1 (a) An overview of row-column decomposition (RCD) for 2D FFT implementation Intermediate storage is required because allelements of the row-by-row operations must be available for column-by-column processing (b) An overview of strided column-wise accessfromDRAMas compared to trivial row-wise access An entire row of elementsmust be read into the row buffer even to access a single elementwithin a specific row

between general-purpose and application-specific accelera-tion solutions

22 DRAM Intermediate Storage Problem There have beenseveral high-throughput 2D FFT FPGA-based implementa-tions over the past few years Most of these rely on repeatedinvocations of 1D FFTs by row and column decomposition(RCD) with efficient use of memory [5 7 12 21 22] RCDmakes use of the fact that a 2D Fourier transform is separableand can be implemented in stages ie a row-by-row 1DFFT can be proceeded by a column-by-column 1D FFT withintermediate storage (Figure 1) Most of the previous RCD-based 2D FFT FPGA implementation approaches have twomajor design challenges (1) The 1D FFT implementationneeds to have a reasonably high throughput and needs tobe resource efficient Moreover spatial parallelism needs tobe exploited by running several 1D FFTs simultaneously (2)External DRAMneeds to be efficiently addressed and to havea high bandwidth

Since the column-by-column 1D FFT requires data fromall rows intermediate storage becomes a major problem forlarge datasets Many implementations rely on local memorysuch as resource-implemented block RAM for intermediatestorage which is not possible for large datasets [21] Large

datasets have to be offloaded to external DRAM becauseonly a portion of the dataset that fits on the chip can beoperated on at a given time For complex image processingapplications this means repeated storage and access to theexternal memory during every stage of processing

As shown in Figure 2(a) DRAM hierarchy from top tobottom is rank chip bank row and column Each DRAMbank (Figure 2(b)) has a row buffer that holds the mostrecently referred row There is only one row buffer per bankwhichmeans only one row from the data-grid can be accessedat once Before accessing a row it has to be activated bytransferring the contents from internal capacitor storage intoa set of parallel sense amplifiersThe row buffer is the so-calledldquofast bufferrdquo becausewhen a row is activated and placed in thebuffer any element can be accessed at random

If a new row has to be activated and accessed into therow buffer a row buffer miss occurs and requires a higherlatency 119860푚푖푠푠 (Figure 2(c)) On the contrary if the desiredrow is already in the buffer a row buffer hit or page hit occursand the latency to access elements is substantially lower119860ℎ푖푡This implies 119860푚푖푠푠 = 119860ℎ푖푡 + 119862푟 where 119862푟 is the overheadassociated with accessing a new row to read a specificelement (Figure 2(c)) [12] There is also overhead involved inwriting the row back to the data-grid (precharge) say 119862푤

4 International Journal of Reconfigurable Computing

DRAM Module

Bank 1 Bank 2

Bank 3 Bank s

Bank 1 Bank 2

Bank 3 Bank s

Bank 1 Bank 2

Bank 3 Bank s

Rank 1Rank 0

Chip 0

Chip 1

Chip p

IO

Bus

(a) DRAM hierarchy

DRAM BANK

Column AddressDecoder

Data Array

IO

Row

Add

ress

Row Buffer

Dec

oder

Address Column Address

Row

Addr

ess

(b) Single DRAM bank

DecodeRow

RowPrecharge

ActivateRow

DecodeColumn

Row Buffer Miss

ExtraLatency

Row Buffer Hit

(c) Row buffer hit and miss

Figure 2 (a) An overview of the DRAM hierarchy (b) Image showing the structure of a single DRAM bank (c) Flow chart explainingadditional latency introduced when a new row has to be referred to in the row buffer to access a specific element

However both 119862푟 and 119862푤 can be concealed by interleaving(switching between banks) Since row-wise access is trivialthe row-by-row 1D FFT part of RCD-based 2D FFT is easilyaccomplished However once the row-by-row 1D FFT isstored in the DRAM in standard row-major order to accessa single column each row of the DRAM has to be accessedinto the row buffer rendering the read process extremelyinefficient This is typically the major bottleneck for high-throughput 2D FFTs (Figure 1(b)) We address this problemby designing a custommemory mapping scheme (Section 5)

23 Edge Artifacts While calculating 2D DFTs it is assumedthat the image is periodic which is usually not the caseThe nonperiodic nature of the image leads to artifacts in theFourier transform usually known as edge artifacts or seriestermination errors These artifacts appear as several crossesof high-amplitude coefficients in the frequency domain(Figure 3(b)) Such edge artifacts can be passed to subsequentstages of processing and in biomedical applications theymaylead to critical misinterpretations of results No current 2DFFT FPGA implementation addresses this problem directlyThese artifacts may be removed during preprocessing usingmirroring windowing zero padding or postprocessing eg

filtering techniques These techniques are usually computa-tionally intensive involve an increase in image size and alsotend to modify the transform

The most common approach is by ramping the imageat corner pixels to slowly attenuate the edges Ramping isusually accomplished by an apodization function such asa Tukey (tapered cosine) or a Hamming window whichsmoothly reduces the intensity to zero Such an approach canbe implemented on an FPGA as a preprocessing operationby storing the window function in a look-up table (LUT)and multiplying it with the image stream before calculatingthe FFT [16] Although this approach is not extremelycomputationally intensive for small images it inadvertentlyremoves necessary information from the image Loss of thisinformation may have serious consequences if the imageis being further processed with several other images toreconstruct a final image that is used for diagnostics or otherdecision-critical applications Another common method isby mirroring the image from 119873 times 119873 to 2119873 times 2119873 Doingso makes the image periodic and reduces edge artifactsHowever this not only increases the size of the image by 4timesbut also makes the transform symmetric which generates aninaccurate phase component

International Journal of Reconfigurable Computing 5

(a) (b)

(c) (d)

(e) (f)

Figure 3 (1a) An image with nonperiodic boundary (1b) 2D DFTof (1a) (1c) DFT of the smooth component ie the removedartifacts from (1a) (1d) Periodic component ie DFT of (1a) withedge artifacts removed (1e) Reconstructed smooth component (1f)Reconstructed periodic component

Simultaneously removing the edge artifacts while cal-culating a 2D FFT imposes an additional design challengeregardless of the method used However these artifacts mustbe removed in applications where they may be propagated tosubsequent processing levels An ideal method for removingthese artifacts should involve making the image periodicwhile removing minimal information from the image Peri-odic plus smooth decomposition (PSD) first presented byMoisan [4] and used in [23ndash25] is an ideal method forremoving edge artifacts (specifically for biomedical appli-cations) because it does not directly intervene with pixelsbeside those of the boundary and does not increase imagesizeMoreover its inherently parallel naturemakes it ideal fora high-throughput FPGA-based implementation We havefurther optimized the original PSD decomposition algorithmto make the overall implementation much more efficient by

decreasing the number of required 1D FFT invocations andby reducing external DRAM utilization (Section 4)

24 LabVIEW FPGA High-Level Design Environment Amajor concern while designing complex image processinghardware accelerators is how to fully harness on the divide-and-conquer approach Algorithms that have to be mappedto multiple FPGAs are often marred by communicationproblems and custom FPGA boards reduce flexibility forlarge-scale and evolving designs For rapid prototyping ofour algorithms we used LabVIEW FPGA 2016 (NationalInstruments) a robust data-flow-based graphical designenvironment LabVIEW FPGA provides integration withNational Instruments (NI) Xilinx-based reconfigurable hard-ware allowing efficient communication with a host PC andhigh-throughput communication between multiple FPGAsthrough a PXIe (PCI eXtensions for Industry Express) busLabVIEW FPGA also enables us to integrate external Hard-ware Description Language (HDL) code and gives us theflexibility to expand our design for future processing stagesWe used NI PXIe-7976R FPGA boards that have a XilinxKintex 7 FPGA and 2GB high-bandwidth (10GBs) externalmemory This platform has already been extensively used forrapid prototyping of communication standards and protocolsbefore moving to ASIC designs The optimizations anddesigns we present here are scalable to most reconfigurablecomputing-based systems Moreover LabVIEW FPGA pro-vides efficient high-level control over memory via a smartmemory controller

3 Periodic Plus Smooth Decomposition (PSD)for Edge Artifact Removal (EAR)

Periodic plus smooth decomposition (PSD) involves decom-posing the image into a periodic and smooth componentto remove edge artifacts with minimal loss of informationfrom the image [4] This section presents an overview ofthe PSD algorithm and profiles the algorithm for possibleparallelization and optimization to achieve efficient FPGAimplementation

Let us have discrete 119899 by 119898 gray-scale image 119868 on a finitedomain Ω = 0 1 119899 minus 1 times 0 1 119898 minus 1 The discreteFourier transform (DFT) of 119868 is defined as

(119904 119905) = sum(푖푗)isinΩ

119868 (119894 119895) exp(minus1205802120587 (119904119894119899 + 119905119895119898)) (1)

This is equivalent to a matrix multiplication119882119868119881 where

119882 = (((((((

1 1 1 11 119908 1199082 119908푛minus11 1199082 1199084 1199082(푛minus1) 1 119908푛minus2 1199082(푛minus2) 119908(푛minus2)(푛minus1)1 119908푛minus1 1199082(푛minus1) 119908(푛minus1)(푛minus1)

)))))))

(2)

6 International Journal of Reconfigurable Computing

Step B Step C Step D

Step EStep A

DMAFIFODMA

FIFO

DMAFIFO

IO

Host PCFPGA 1

FPGA 2

SmartImagingDevice Host PC

2D FFT(RCD with DRAM

IntermediateStorage)

2D FFT(RCD with DRAM

IntermediateStorage)

ComputePeriodicBoarder

ComputeSmooth

Component

ComputePeriodic

Component

Figure 4 A top-level architecture for OPSD using two FPGAs and a host PC connected over a high-bandwidth bus The steps are associatedwith Algorithm 1

and

119908푘 = exp(minus1198942120587119899 )푘 = exp(minus1198942120587119896119899 ) (3)

119881 has the same structure as119882 but is m-dimensional 119908푘 hasperiod 119899 which means that 119908푘 = 119908푘+푙푛 forall119896 119897 isin N therefore

119882 = ((((((

1 1 1 1 11 119908 1199082 119908푛minus2 119908푛minus11 1199082 1199084 119908푛minus4 119908푛minus2 1 119908푛minus2 119908푛minus4 1199084 11990821 119908푛minus1 119908푛minus2 1199082 1199081))))))

(4)

Since in general 119868 is not (119899119898)-periodic there will behigh-amplitude edge artifacts present in the DFT stemmingfrom sharp discontinuities between the opposing edges ofthe image as shown in Figure 3(b) Reference [4] proposeda decomposition of 119868 into a periodic component 119875 whichis periodic and captures the essence of the image with allhigh-frequency details and a smoothly varying background119878 which recreates the discontinuities at the borders so 119868 =119875+ 119878 Periodic plus smooth decomposition can be computedby first constructing a border image 119861 = 119877 + 119862 where 119877represents the boundary discontinuities when transitioningrow-wise and 119862when going column-wise

119877 (119894 119895)= 119868 (119899 minus 1 minus 119894 119895) minus 119868 (119894 119895) 119894 = 0 or 119894 = 119899 minus 10 otherwise

119862 (119894 119895)= 119868 (119894 119898 minus 1 minus 119895) minus 119868 (119894 119895) 119895 = 0 or 119895 = 119898 minus 10 otherwise

(5)

It is obvious that the structure of the border image119861 is simplewith nonzero values only in the edges as shown below

119861 = 119877 + 119862= (((

11988711 11988712 1198871푚minus1 1198871푚11988721 0 0 minus11988721 119887푛minus11 0 0 minus119887푛minus11119887푛1 minus11988712 minus1198871푚minus1 minus119887푛푚)))

(6)

The DFT of the smooth component 119878 can be then found bythe following formula

(119904 119905) = (119904 119905)2 cos (2120587119904119899) + 2 cos (2120587119905119898) minus 4 forall (119904 119905) isin Ω (0 0) (7)

The DFT of the image 119868 with edge artifacts removed isthen = minus Figures 3(c) and 3(d) show the DFT ofthe smooth and periodic components respectively Figures3(e) and 3(f) show the reconstructed periodic and smoothcomponents On reconstruction it is evident that there isnegligible visual difference between the actual image and theperiodic reconstructed image for this example

31 Profiling PSD for FPGA Implementation Algorithm 1summarizes the overall PSD implementation There areseveral ways of arranging the algorithm We have arrangedit so that DFTs of the periodic and smooth componentsare readily available for further processing stages For bestresults both the periodic and smooth components shouldundergo similar processing stages and should be added backtogether before displaying the result However dependingon the application it might be acceptable to discard theperiodic component completely For an 119899 times 119898 image stepsA and C have a complexity of O(119899119898 log(119899119898)) and steps Band D have complexity O(119898 + 119899) and O(119898119899) respectivelyComputationally the performance of PSD is limited by stepsA and C Figure 4 shows a proposed top-level architecture

International Journal of Reconfigurable Computing 7

where step A and steps B C and D are completed on separateFPGAs while step E can be done on the host PC Two highend FPGAs are used instead of one because resources onone FPGA are insufficient to compute a large size 2D FFTas well its edge artifact removal components The overallperformance may be limited by FPGA 2 where most of theserial part of the algorithm lies There are two major factorswhich limit the throughput of such a design

(1) While FPGA 1 and FPGA 2 can run in parallel theresult of step A from FPGA 1 has to be stored on thehost PC while steps B C and D are completed onFPGA 2 before step E can be completed on the hostPC

(2) The DRAM intermediate storage problem explainedin Section 22 and Figures 1 and 2 has to be addressedsince strided access to the DRAM for column-wiseoperations can significantly limit throughput

As for (1) it has been addressed in the next section wherewe make use of the inherent symmetry of the boundaryimage to reduce the time required to compute the 2D FFTof the boundary image As for (2) it has been addressed bydesigning a semi-custommemory mapping controller whichrdquotilesrdquo the DRAM floor and rdquohopsrdquo between several tiles so asto minimize strided memory access

4 Proposed Optimized Periodic Plus SmoothDecomposition (OPSD)

In this section we optimize the original PSD algorithmThisoptimization is to effectively reduce the number of 1D FFTinvocations and the number of times the DRAM is accessed

Equation (6) shows that except the corners the boundaryimage 119861 is symmetrical in the sense that boundary rows andcolumns are algebraic negation of each other In total119861has 119899+119898 minus 1 unique elements with the following relations betweencorners with respect to columns and rows

11988711 = 11990311 + 119888111198871푚 = 1199031푚 minus 11988811119887푛1 = minus11990311 + 119888푛1119887푛푚 = minus1199031푚 minus 119888푛1(8)

997904rArr 119887푛푚 = minus11988711 minus 1198871푚 minus 119887푛1 (9)

In computing the FFTof119861 one normally proceeds by firstrunning 1D FFTs column-by-column and then 1D FFTs row-by-row or vice versa An FFT of a column vector 119907with length119899 is119882119907 where119882 is given in (4)The column-wise FFT of thematrix 119861 is then

=119882119861 (10)

Let us have a closer look at the first column denoted by 119861sdot1The 1D FFT of this vector is

sdot1 =119882119861sdot1

= (((((((

1 1 11 119908 119908푛minus11 1199082 1199082(푛minus1) 1 119908푛minus2 119908(푛minus2)(푛minus1)1 119908푛minus1 119908(푛minus1)(푛minus1)

)))))))

((((((

119887111198872111988731 119887푛minus11119887푛1

))))))

=((((((((((((((((

푛sum푖=1

119887푖1푛sum푖=1

119887푖1119908푖minus1푛sum푖=1

119887푖11199082(푖minus1) 푛sum푖=1

119887푖1119908(푛minus2)(푖minus1)푛sum푖=1

119887푖1119908(푛minus1)(푖minus1)

))))))))))))))))

(11)

It can be shown that the 1D FFT of the column 119895 isin2 3 119898 minus 1 issdot119895 =119882119861sdot119895

= (((((((

1 1 11 119908 119908푛minus11 1199082 1199082(푛minus1) 1 119908푛minus2 119908(푛minus2)(푛minus1)1 119908푛minus1 119908(푛minus1)(푛minus1)

)))))))

((((((

1198871푗00 0minus1198871푗

))))))

= 1198871푗((((((

01 minus 119908푛minus11 minus 119908푛minus2 1 minus 11990821 minus 119908

))))))

= 1198871푗^

(12)

8 International Journal of Reconfigurable Computing

Input 119868(119894 119895) of size 119899 times 119898Output (119904 119905) (119904 119905)

Step A Compute the 2D DFT of image 119868(119894 119895)1 119868(119894 119895) F997888rarr (119904 119905)Step B Compute periodic border 119861

2 while 1 lt 119895 lt 119898 do3 while 1 lt 119894 lt 119899 do4 if (119894 = 0 or 119894 = 119899 minus 1) then5 119877(119894 119895) larr997888 119868(119899 minus 1 minus 119894 119895) minus 119868(119894 119895)6 else7 119877(119894 119895) larr997888 08 end if9 if (119895 = 0 or 119895 = 119898 minus 1) then10 119862(119894 119895) larr997888 119868(119894 119898 minus 1 minus 119895) minus 119868(119894 119895)11 else12 119862(119894 119895) larr997888 013 end if14 end while15 end while16 119861larr997888 119877 + 119862Step C Compute the 2D DFT of (119904 119905)

17 119861(119894 119895) F997888rarr (119904 119905)Step D Compute the Smooth Component (119904 119905)

18 (119904 119905) larr997888 2 cos(2120587119904119899) + 2 cos(2120587119905119898) minus 419 (119904 119905) larr997888 (119904 119905) divide (119904 119905)

Step E Compute the Periodic Component (119904 119905)20 (119904 119905) larr997888 (119904 119905) minus (119904 119905)21 return (119904 119905) (119904 119905)

Algorithm 1 Periodic plus smooth decomposition (PSD)

and the 1D FFT of the last column 119861sdot119898 is

sdot119898 =119882119861sdot119898 (13)

=((((((((((((((((

minus 푛sum푖=1

119887푖1minus 푛sum푖=1

119887푖1119908푖minus1 + (11988711 + 1198871푚) (1 minus 119908푛minus1)minus 푛sum푖=1

119887푖11199082(푖minus1) + (11988711 + 1198871푚) (1 minus 1199082(푛minus1)) minus 푛sum푖=1

119887푖1119908(푛minus2)(푖minus1) + (11988711 + 1198871푚) (1 minus 119908(푛minus2)(푛minus1))minus 푛sum푖=1

119887푖1119908(푛minus1)(푖minus1) + (11988711 + 1198871푚) (1 minus 119908(푛minus1)(푛minus1))

))))))))))))))))

(14)

sdot119898 = minussdot1 + (11988711 + 1198871푚) ^ (15)

Therefore the column-wise FFT of the matrix 119861 is

= (119861sdot1 11988712^ 1198871푚minus1^ minussdot1 + (11988711 + 1198871푚) ^) (16)

To compute the column-by-column 1DFFT of thematrix119861 we only have to compute the FFT of the first vectorand then use the appropriately scaled vector ^ to derivethe remainder of the columns The row-by-row FFT has

to be calculated in a row burst normal way Algorithm 2presents a summary of the shortcut for calculating (119904 119905)The steps presented in Algorithm 2 can replace step Cin Algorithm 1 By reducing column-by-column 1D FFTcomputations for the boundary image this method cansignificantly reduce the number of 1D FFT invocationsreduce the overall DRAM access and eliminate problematiccolumn-wise strided DRAM access for an efficient FPGA-based implementation For column-wise operations a single1D FFT of size 119898 is required rather than 119899119898 1D FFTs of size119898 Moreover since one has to simply store one column ofdata it can be stored on the on-chip local memory (BRAMor SRAM) This can be implemented by temporarily storingthe initial vector sdot1 and scaling factors 1198871푗 in the blockRAMregister memory drastically reducing DRAM accessand lowering the number of required 1D FFT invocationsPerformance evaluation for this has been presented in theresults section

Table 1 shows a comparison of mirroring PSD andour proposed OPSD with respect to DRAM access pointsMirroring has been used for comparison purposes because itis an alternative technique that reduces edge artifacts whilemaintaining maximum amplitude information Howeverdue to replication of the imagemost of the phase informationis lost Figure 5 graphically shows that our OPSDmethod sig-nificantly reduces reading fromexternalmemory and reduces

International Journal of Reconfigurable Computing 9

Input 119861(119894 119895) of size 119899 times 119898Output (119904 119905)2DF(119861) lArrrArr 119861 F

119888119908997888997888997888rarr 119888119908 F119903119908997888997888997888rarr

Column-by-Column DFT via Symmetrical Short-cut1 119861sdot1

F997888rarr sdot12 while 1 lt 119895 lt 119898 do3 sdot119895 larr997888 1198871푗^4 end while5 sdot119898 larr997888 minussdot1 + (11988711 + 1198871푚)^6 119888119908 larr997888 119862119900119899119888119886119905119890119899119886119905119890 sdot1 sdot2 sdot119898minus1 sdot119898

Row-by-Row DFT7 119888119908

F119903119908997888997888997888rarr (119904 119905)

8 return (119904 119905)cw column-wisecolumn-by-columnrw row-wiserow-by-row

Algorithm 2 Proposed symmetrically optimized computation of (119904 119905)

DRA

M A

cces

s Poi

nts

2

18

16

14

12

1

08

06

04

02

0

Image Size (N=M)0 1000 2000 3000 4000 5000

DRAM Access - PSD vs OPSD

MirroringPSDOPSD (Proposed)

times108

Figure 5 Graph showing DRAM access (equal to number of DFT points to be computed) with increasing image size for mirroring periodicplus smooth decomposition (PSD) and our proposed optimized period plus smooth decomposition (OPSD)

the overall number of DFT computations required It shouldbe noted that such optimization is only possible for eithercolumn-wise or row-wise operations because after either ofthese operations the output is not symmetrical anymoreCompleting the column-wise operation first prevents stridedreading however this results in strided writing to the DRAMbefore row-wise traversal can startThis can be minimized bymaking efficient use of the local block RAM Output columnsare stored in the block RAM before being written to theDRAM in patches such that each row buffer access writeselements in several columns

5 Proposed Tile-Hopping Memory Mappingfor 2D FFTs

In this section we propose a tile-hopping external memoryaccess pattern for efficiently addressing external memory

during intermediate storage between row-wise and column-wise 1D FFT operations to calculate a 2D FFT As explainedin Section 22 and Figure 2 column-wise reads from DRAMcan be costly due to the overhead associated with activatingand precharging In the worst case scenario it can limitDRAM bandwidth up to 80 [26] This is a problem withall such image processing operations where one stage of theprocessing has to be completed on all elements before thenext stage can start In the past there have been severalimplementations using localmemory however with growingdemand for larger image sizes external memory has to beused There have been several DRAM remapping attemptsbefore such as [5 13] They propose a tile-based approachwhere an 119899 times 119899 image (input array) is divided into 119899119896 times 119899119896tiles where 119896 is the size of the DRAM row buffer whichallows for very high-bandwidthDRAMaccess Although this

10 International Journal of Reconfigurable Computing

Table 1 Comparing mirroring PSD and OPSD

Algorithm DRAM Access DFTPoints Points

Mirroring 8119873119872 8119873119872P + S Decomposition (PSD) 4119873119872 4119873119872Optimized PSD (Proposed) 3119873119872 +119873 +119872 minus 1 3119873119872 +119872method may be ideal to maximize the DRAM performancefor 2D FFTs it incurs a high resource cost associated withlocal memory transposition and storing large chunks of data(entire rowcolumn of tiles) in the local memory Moreovertiling in the image domain also requires remapping row-by-row operations Another approach to reducing stridedDRAM access has been presented in [27] They present a2D decomposition algorithmwhich decomposes the probleminto smaller subblock 2D FFTs which can be performedlocallyThis introduces extra row and columndata exchangesand total number of operations are increased fromO(1198992 log 119899)toO(1198992(1+log 119899)) Other implementations do not address theexternal memory issue in detail

We propose tile-hopping address mapping which reducesthe number of row activations required to access a singlecolumn Unlike [5] our approach does not require significantlocal operations or storage The proposed memory mappingcontroller was designed on top of LabVIEW FPGArsquos existingmemory controller which efficiently controls interleaving andissues activation and precharge commands in parallel withdata transfer The reduced number of row activations alsoreduces the amount of energy required by the DRAM Thiswill be further discussed in the energy evaluation presentedin Section 64 and Table 3 where we demonstrate that theproposed tile-hopping address mapping can reduce energyconsumption up to 53

Instead of writing the results of the row-by-row 1D FFTin row-major order we remap the results in a blocked or tiledpattern as shown in Figure 6This means that when accessingan image column several elements of that column can beretrieved from a single DRAM row access For an 119899times119899 imageeach row of size 119899 can be divided into ℎ tiles (ie 119899 = ℎ119873(119905)where119873(119905) is the number of elements in each tile)These tilescan be remapped onto the DRAM floor as shown in Figure 6If the size of the row is small enough it may be possible toconvert it into a single tile (ie ℎ = 1) However this isunlikely for realistic image sizes For a tile of size119901times119902 a singlerow of the image is written into the DRAM by transitioningthrough 119901ℎ rows If 119896 is the size of the row buffer thereare 119896119902 distinct tiles represented in each DRAM row and itcontains the same number of elements from a single imagecolumn Given regular row-major storage when accessingcolumn-wise elements one would have to transition through119899 DRAM rows to read a single image column Howeverwith this approach when accessing an image column 119896119902elements of that column could be read from a single DRAMrow which has been referred to in the row buffer Althoughthe cost of writing an image row is higher when compared toa standard row-major DRAM writing pattern (ie referring

to119901ℎ rather than 119899 rows) the number ofDRAMrow referralsduring column-wise read is reduced to 119899119902119896 which is 119899(1 minus119902119896) less row referrals for a single column read

We refer to this method as tile-hopping because it entailsmapping data onto several DRAM tiles and then hoppingbetween the tiles such that several elements of the imagecolumn exist in a DRAM row which has been referred toin the row bank Although this mapping scheme has beendeveloped for column-wise access required during 2D FFTcalculation the scheme is general and can be adapted to otherapplications Performance evaluation of thismethod has beenpresented in the experiments section

6 Experimental Results and Analysis

61 Hardware Configuration and Target Selection Since 2DDFTs are usually used for simplifying convolution operationsin complex image processing andmachine vision systems weneeded to prototype our design on a system that is expandablefor next levels of processing As mentioned earlier for rapidprototyping of our proposed OPSD algorithm and tile-hopping memory mapping scheme we used a PXIe-basedreconfigurable system PXIe is an industrial extension ofa PCI system with an enhanced bus structure that giveseach connected device dedicated access to the bus with amaximum throughput of 24GBs This allows a high-speeddedicated link between a host PC and several FPGAs TheLabVIEWFPGAgraphical design environment is efficient forrapid prototyping of complicated signal and image processingsystems It allows us to effectively integrate external HDLcode and LabVIEW graphical design on a single platformMoreover it allows a combination of high-level synthesis(HLS) and custom logic Since current HLS tools havelimitations when it comes to complex image and signalprocessing tasks LabVIEW FPGA tries to bridge these gapsby streamlining the design process

We used FlexRIO (Flexible Reconfigurable IO) FPGAboards plugged into a PXIe chassis PXIe FlexRIO FPGAboards are adaptable and can be used to achieve highthroughput because they allow direct data transfer betweenmultiple FPGAs at rates as high as 8GBs This can signifi-cantly simplify multi-FPGA systems which usually commu-nicate via a host PC This feature allows expansion of oursystem to further processing stages making it flexible for avariety of applications Figure 7 shows a basic overview ofa PXIe-based multi-FPGA system with a host PC controllerconnected through a high-speed bus on a PXIe chassis

Specifically we used two NI PXIe-7976R FlexRIO boardswhich have Kintex 7 FPGA and 2GB external DRAM withtheoretical data bandwidth up to 10GBs This FPGA boardwas plugged into a PXIe-1085 chassis along with a PXIe-8880Intel Xeon PC controller PXIe-1085 can hold up to 16 FPGAsand has 8 GBs per-slot dedicated bandwidth and an overallsystem bandwidth of 24 GBs

62 Experimental Setup As per Algorithms 1 and 2 dis-cussed in previous sections implementation involves fivestages (A) calculating the 2D FFT of an image frame (B)

International Journal of Reconfigurable Computing 11

n

n

Row-by-Row (RR) Output

(a) Dataset view

k

p

q

Writing RR Output to DRAM

Row Buffer

(b) Memory view

k

Reading from DRAM

Row Buffer

(c) Memory view

Figure 6 Image showing tile-hopping (a) Image-level view showing tiles (b) DRAM-level view showing tile placement while writing (c)DRAM-level view showing column reading from the tiles

External IO PXIe

Host PC ControllerDisplay

Device

PXIeFPGA Board 1

High-throughput PXIe BUSPXIe Chassis

External IO

PXIeFPGA Board 2

External IO

PXIeFPGA Board 3

External IO

Figure 7 Block diagram of a PXIe-based multi-FPGA system witha host PC controller connected through a high-speed bus on a PXIechassis [8]

calculating the boundary image (C) calculating the 2DFFT of the boundary image (D) calculating the smoothcomponent and (E) subtracting the smooth component fromthe 2D FFT of the original image to achieve the periodiccomponent The bottleneck consistently occurs in A and theserial part of the algorithm (A 997888rarr B 997888rarr C) The limitation

due to A is reduced removing the so-called ldquomemory wallrdquoby using our proposed tile-hopping-based memory mappingThe limitations due to the serial part of the algorithm arereduced by using OPSD rather than PSD For quantificationthe delay for A is 062ms for a 512 times 512 image

The design flow presented in Figure 4 was followedData-flow is clearly shown in a graphical programmingenvironment making it easier to visualize how a designefficiently fits on an FPGA Highly efficient implementationsof 1D FFT were used from LabVIEW FPGA for parallelrow-by-row operations and by integrating Xilinx LogiCOREfor column-by-column operations Each stage of the designwas dynamically tested and benchmarked The image wasstreamed from the host PC using a Direct Memory Access(DMA) FIFO 1D FFTs are performed in parallel rows of 8and stored in the DRAM via local memory in a tiled patternas explained in the previous section This follows readingseveral rows to extract a single column which is Fourier-transformed using Xilinx LogiCORE and is sent back to

12 International Journal of Reconfigurable Computing

I(00)

I(n-1 m-1)

i

j

N x M size image

a b

a

1D FFT

c

Ori

gina

l Im

age

Boun

dary

Imag

e

TrigonometricFunctions

Equation (7)

FFT ofSmooth Component

FFT of Periodic Component(With Edge Artifact

Removed)

b

Shortcut for Column-wise 1D FFT

Exte

rnal

Mem

ory

Exte

rnal

Mem

ory

Row-wise 1D FFT

c

Bloc

k M

emor

y

Exte

rnal

Mem

ory

Row-wise 1D FFT

Host PC Controller(NI PXIe-8880)

FPGA-1(NI PXIe-7976R)

Host PC Controller (NI PXIe-8880)

PXIe Chassis (NI-PXIe1085)

FPGA-2(NI PXIe-7976R)Block Memory

Bloc

k M

emor

y

+

minusColumn-wise 1D FFTlowast

Row-Major Memory Mapping Tile-Hopping Memory Mapping

I(n-1 j) ndash I(0 j)

I(i

0) ndash

I(i

m-1

)

I(i

m-1

) ndash I(

i 0)

I(0 j) ndash I(n-1 j)

middot1

b1jv

-middot1 + (b11+b1m)v

Figure 8 Functional block diagramof PXIe-based 2DFFT implementationwith simultaneous edge artifact removal using optimized periodicplus smooth decomposition The OPSD algorithm is split among two NI-7976R (Kintex-7) FPGA boards with 2GB external memory and ahost PC connected over a high-bandwidth bus The image is streamed from the PC controller to FPGA 1 and FPGA 2 FPGA 1 calculates therow-by-row 1D FFT followed by column-by-column 1D FFT with intermediate tile-hopping memory mapping and sends the result back tothe host PC FPGA 2 receives the image calculates the boundary image and proceeds to calculate the 1D FFT column-by-column FFT usingthe shortcut presented in (16) followed by row-by-row 1D FFTs and the result is sent back to the host PC

Camera

HostPC

External Memory

FPGA Board 1

FPGA

DMAFIFO

FIFO IN(Local Memory - (Local Memory ndash

Write)

FIFO OUT

Read)

PXIe

BU

S

CU

1D FF4H

1D FF41

1D FF42

Figure 9 Block diagram of 2D FFT showing data transfer betweenexternal memory and local memory scheduled via a Control Unit(CU)

the host PC If the image is being streamed directly froman imaging device which scans and provides random ornonlinear sequence of rows it is necessary to store a frameof the image in a buffer This can also be accomplishedby streaming the image flow from the host PC or using asmart camera which can delay image delivery by a singleframe Local memory shown in Figure 9 is used to bufferdata between external memory and 1D FFT cores Thismemory is divided into read and write components and isimplemented using FPGA slices Block RAM (BRAM) is usedfor temporary storage of vectors required for calculating the2D FFT of the boundary image (in the case of FPGA 2)The Control Unit (CU) organizes scheduling for transferring

data between local and external memory CU is based onLabVIEWrsquos existing memory controller and our memorymapping scheme presented in Section 5

Step B was accomplished using standard LabVIEWFPGA HLS tools for programming (5) using the graphicalprogramming environment In step C the 2D FFT of theboundary image needs to be calculated by row and columndecomposition However as shown mathematically in theprevious section the initial row-wise FFTs can be calculatedby computing the 1D FFT of the first (boundary) vector andthe FFTs of reaming vectors can be computed by appropriatescaling of this vector

We need the boundary column vector for 1D FFT calcula-tion of the first and last columns We also need the boundaryrow vector for appropriate scaling of for the 1D FFT ofevery column between the first and last columns Row andcolumn vectors of the boundary image are stored in blockRAM (BRAM) Figure 8 shows a functional block diagramof the overall 2D FFT with optimized PSD process Steps Dand E are performed on the host PC to minimize memoryclashes and to access the periodic and smooth componentsof each frame as they become available 86 of resources areused on FPGA 1 and 41 resources are used on FPGA 2 Theresource utilization is reported according to LabVIEWFPGAsynthesis and compilation experiments It should be notedthat part of the reason for high resource utilization is becauseof using LabVIEW FPGA high-level synthesis tools as wellas Xilinx LogiCORE tools Using standardized tools makesit harder to optimize for resource utilization The currentimplementation was optimized for performance in terms ofrun time and energy consumption

International Journal of Reconfigurable Computing 13

Fram

esS

econ

d (1

s)

105

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7)Kintex-7 (No DRAM Opt)Kintex-7 (DRAM Opt)Theoretical Peak

(a) 2D FFT performance DRAM tile-hopping

Fram

esS

econ

d (1

s)

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7) (PSD)Kintex-7 (PSD)Kintex-7 (OPSD)

(b) 2D FFT with EAR performance PSD versus OPSD

Figure 10 Performance evaluation in terms of frames per second for (a) 2D FFTs with tile-hopping memory pattern (b) 2D FFTs with edgeartifact removal (EAR) using OPSDThe performance evaluation shows the significance of the two optimizations proposed Both axes are ona log scale

63 Performance Evaluation The overall performance of thesystem was evaluated using the setup presented in Figure 9The data was streamed from the host PC in certain caseshigh frame rate videos as well as direct camera input werestreamed from the host All results presented are for 16-bitfixed-point precision Figure 10(a) presents the effectivenessof our proposed tile-hopping memory mapping scheme Itclearly shows the effectiveness of our proposed memorymapping since it is closer to the theoretical peak performanceFigure 10(b) presents the overall results comparing PSD andOPSD and demonstrating the effectiveness of our proposedoptimization PSD was also implemented on the same plat-form but the optimization presented in Section 4 was notused This rendered the serial portion of the algorithm to bethe bottleneck which reduced overall performance Table 2shows a comparison of our implementation in contrast torecent 2D FFT FPGA implementation approaches and showsthat we achieve a better performance even with simultaneousedge artifact removal Although our implementation is testedwith 16-bit fixed-point precision which limits the accuracy ofthe transform the precision may be sufficient for a varietyof speed critical applications where alternative edge artifactremoval methods (eg filtering) may decrease overall systemperformance Although the dynamic range of fixed-pointdata is smaller than floating-point data which can lead toerrors in the 2D FFT for certain applications it is moreimportant to have an artifact-free transform as compared toa highly accurate transform

64 Energy Evaluation The overall energy consumptionof the custom computing system depends on (1) powerperformance of the system components and (2) throughputdisparity Throughput disparity results in idle time for at

least one of the components and lowers the overall sys-tem throughput A throughput-optimized system minimizesinstances where certain components of the architecture areidle The proposed optimizations in Sections 4 and 5 clearlyreduce throughput disparity andminimize the idle time of thesystem Thus besides causing delays due to significant over-head standard column-wise DRAM access also contributesto the overall energy consumption This is not only due tohigh count of DRAM row charges but also because of energyconsumed by the FPGA in idle state Ideally maximizing theDRAMbandwidth limits the amount of energy consumptionProposed ldquotile-hoppingrdquo memory mapping scheme improvesthe DRAMbandwidth as seen in Figure 10 and hence reducesthe overall energy consumption The same is true for theproposed OPSD method where reduced DRAM access and1D FFT invocations lead to reduced energy consumption Inthis section we analyze the amount of improvement in energyconsumption based on the proposed optimizations

We estimate the DRAM power consumption for boththe baseline (standard strided) and the optimized (ldquotile-hoppingrdquo) memory access using MICRON DRAM powercalculator The energy is calculated in 119899119869 for each readie energy per read This is accomplished by calculatingthe run time for a specific 2D FFT and estimating theamount of energy consumed by the DRAM power calculatorTable 4 depicts the DRAM energy consumption for 2D FFTsbefore and after the proposed ldquotile-hoppingrdquo optimization forcolumn-wise DRAM access As mentioned earlier row-wiseDRAM access is fast and row buffer size data can be accessedby a single row activation According to Table 4 the energyrequired for DRAM access is reduced by 427 488 and529 for 1024 times 1024 2048 times 2048 and 4096 times 4096 size 2DFFTs respectively

14 International Journal of Reconfigurable Computing

Table 2 Comparison of OPSD1 2D FFT with regular RCD-based implementation

Platform SEAR2 Precision RT3 (ms)YesNo bits 512 times 512 1024 times 1024

Kintex 7 28nm (ours) Yes 16 (fixed) 15 48Kintex 7 28nm (ours) No 16 (fixed) 09 41Kintex 7 28mm [8] Yes 16 (fixed) 324 1167Stratix IV [5] No 64 (double) - 61Virtex-5-BEE3 65nm[14] No 32 (single) 249 1026Virtex-E 180nm [21] No 16 (fixed) 286 769ASIC 180nm No 32 (single) 210 -1 Optimized periodic + smooth decomposition (OPSD)2 Simultaneous edge artifact removal3 Runtime (ms)

Table 3 DRAM energy consumption baseline vs tile-hopping1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPRlowast CW∘ Read(Baseline) 446 577 712EPR CW ReadTile-Hopping 254 295 336(Proposed)Reduction () 427 488 529lowastEnergy per read (EPR)∘Column-wise memory access (CW)

Table 4 2D FFT + EAR energy consumption baseline vs optimized (OPSD + tile hopping)1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPPdagger 2D FFT + EARBaseline 3692 4125 48352D FFT+EAR (Opt)Tile-Hopping + OPSD 1588 1706 1811(Proposed)Improvement 23times 24times 28timesdaggerEnergy per point (EPP)

The metric used to compare the overall energy optimiza-tion achieved for 2D FFTs with EAR is energy per point iethe amount of average energy required to compute the 2DFFT of a single point in an image with simultaneous edgeartifact removal This was achieved by calculating the energyconsumed by Xilinx LogiCORE IP for 1D FFTs the DRAMand the edge artifact removal part separately The estimatedenergy calculated does not include energy consumed by thePXIe chassis and the host PC Essentially the FPGA-basedarchitecture presented here could be used without the hostcontroller The energy consumption incorporates dynamicas well as static power The overall energy consumption perpoint is reduced by 569 586 and 62 for calculating1024 times 1024 2048 times 2048 and 4096 times 4096 size 2D FFTs withEAR respectively

7 Application Filtered Back-Projectionfor Tomography

In order to further demonstrate the effectiveness of ourimplementation we use the created 2D FFT module asan accelerator for reducing the run time for filtered back-projection (FBP) FBP is a fundamental analytical tomo-graphic image reconstruction method In depth detailsregarding the basic FBP algorithm have been left out forbrevity but can be found in [28 29]Themethod can be usedto reconstruct primitive 3D tomograms from 2D data whichcan then be used as a basis for more complex regularization-basedmethods such as [30ndash32]The algorithmic flow is basedon the Fourier slice theorem ie 2D Fourier transformsof projections are an angular component of the 3D Fourier

International Journal of Reconfigurable Computing 15

Table 5 Comparing filtered back-projection (FBP) runtime (as an application for using the proposed 2D FFT with simultaneous EAR)

3D Density CPU (i7) FPGA + Host PC (i7)Sec Sec128 times 128 times 128 213 sec 195 sec256 times 256 times 256 475 sec 424 sec512 times 512 times 512 948 sec 813 sec1024 times 1024 times 1024 3223 sec 2753 sec2048 times 2048 times 2048 16877 sec 13644 sec4096 times 4096 times 4096 164631 sec 125994 sec

Original Phantom CPU Reconstructed Phantom FPGA+CPU Reconstructed Ph

Inte

nsity

Inte

nsity

1

08

06

04

02

0

1

08

06

04

02

0

X Direction0 20 40 60 80 100 120

X Direction0 20 40 60 80 100 120

Original PhantomCPU BP

Original PhantomFPGA+CPU BP

Figure 11 Figure showing a thin slice of filtered back-projectionresults by reconstructing a 128 times 128 times 128 Shepp-Logan phantomThe 3Ddensitywas reconstructed from 180 equally spaced simulatedprojections using standard linearly interpolated FBP and using aRam-Lak filter It can be seen that the results from the FPGA + CPUsolution have some errors this is due to the fact that the 2D FFT ofeach projection is less accurate The results are good enough to beused as a basis for further optimization based refinement methods

transform of the 3D reconstructed volume Our 2D FFTaccelerator was used to calculate the 2D FFTs of the projec-tions as well as for initial stages of the 3D FFTwhich was thencompleted on the host PC Similar to the 2D FFT the 3D FFTis separable and can be divided into 2D FFTs and 1D FFTsThe results have been shown in Table 5 It can be seen thatthe improvement for smaller size densities is not significantbecause their FFTs are quite fast on general-purpose CPUsHowever for larger densities the FFT accelerator can give asignificant improvement If the remaining components arealso implemented on an FPGA significant speed increasecan be achieved Results of a thin slice from a 3D simulatedShepp-Logan [33] phantom have been shown in Figure 11 Itcan be seen that the results from the hardware acceleratedFBP are of slightly lower quality This is due to the factthat our 2D FFT implementation is less accurate (16 bitfixed-point) as compared to the CPU-based implementation(FFTW double-precision floating) The accelerated FBP wasalso tested with real Electron Tomography (ET) data

8 Conclusion

2D FFTs often become a major bottleneck for high-performance imaging and vision systems The inherentcomputational complexity of the 2D FFT kernel is fur-ther enhanced if effective removal (using PSD) of spuriousartifacts introduced by the nonperiodic nature of real-lifeimages is taken into accountWe developed and implementedan FPGA-based design for calculating high-throughput 2DDFTs with simultaneous edge artifact removal Our approachis based on a PSD algorithm that splits the frequency domainof a 2D image into a smooth component which contains thehigh-frequency cross-shaped artifacts and can be subtractedfrom the 2D DFT of the original image to obtain a periodiccomponent that is artifact-free Since this approach calculatestwo 2D DFTs simultaneously external memory addressingand repeated 1D FFT invocations become problematic Tosolve this problem we optimized the original PSD algorithmto reduce the number of DFT samples to be computed andDRAM access Moreover to reduce strided access from theDRAMduring column-wise reads we presented and analyzedldquotile-hoppingrdquo a memorymapping scheme which reduces thenumber of DRAM row activations when reading a singlecolumn of data This memory mapping scheme is generaland may be used for a variety of other applications Wedemonstrate that ldquotile-hoppingrdquomemorymapping can reducethe DRAM energy consumption by 529 Moreover weshow that the proposed optimizations lead to 28times less energyconsumption for the overall 2D FFT with EAR architecture

Our methods were tested using extensive synthesis andbenchmarking using a Xilinx Kintex 7 FPGA communicatingwith a host PC on a high-speed PXIe bus Our system isexpandable to support several FPGAs and can be adaptedto various large-scale computer vision and biomedical appli-cations Despite decomposing the image into periodic andsmooth frequency components our design requires lessrun time compared to traditional FPGA-based 2D DFTimplementation approaches and can be used for a varietyof highly demanding applications One such applicationfiltered back-projection was accelerated using the proposedimplementation to achieve better results specifically for largersize raw tomographic data

Conflicts of Interest

The authors declare that they have no conflicts of interest

16 International Journal of Reconfigurable Computing

Acknowledgments

Thisworkwas supported by JapaneseGovernmentOIST Sub-sidy for Operations (Ulf Skoglund) Grant no 5020S7010020FaisalMahmood andMart Toots were additionally supportedby the OIST PhD Fellowship The authors would like tothank National Instruments Research for their technicalsupport during the design process The authors would alsolike to thank Dr Steven D Aird for language assistance andShizuka Kuda for logistical arrangements

Supplementary Materials

File 2D FFT-EAR-Demomp4 video demonstrating theFPGA output of 2D FFT with edge artifact removal(Supplementary Materials)

References

[1] R N Bracewell The fourier transform and iis applications vol5 New York NY USA 1965

[2] Theory and application of digital signal processing vol 1Prentice-Hall Inc Englewood Cliffs NJ USA 1975

[3] R A Brooks and G Di Chiro ldquoTheory of image reconstructionin computed tomographyrdquo Radiology vol 117 no 3 I pp 561ndash572 1975

[4] L Moisan ldquoPeriodic plus smooth image decompositionrdquo Jour-nal of Mathematical Imaging and Vision vol 39 no 2 pp 161ndash179 2011

[5] B Akin P A Milder F Franchetti and J C Hoe ldquoMemorybandwidth efficient two-dimensional fast Fourier transformalgorithm and implementation for large problem sizesrdquo inProceedings of the 20th IEEE International Symposium on Field-Programmable Custom Computing Machines FCCM 2012 pp188ndash191 Canada May 2012

[6] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[7] H Kee N Petersen J Kornerup and S S BhattacharyyaldquoSystematic generation of FPGA-based FFT implementationsrdquoin Proceedings of the 2008 IEEE International Conference onAcoustics Speech and Signal Processing ICASSP pp 1413ndash1416USA April 2008

[8] F Mahmood M Toots L-G Ofverstedt and U Skoglundldquo2DDiscrete Fourier Transformwith simultaneous edge artifactremoval for real-Time applicationsrdquo in Proceedings of the Inter-national Conference on Field Programmable Technology FPT2015 pp 236ndash239 New Zealand December 2015

[9] M Frigo and S G Johnson ldquoFFTW an adaptive software archi-tecture for the FFTrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing vol 3 pp1381ndash1384 IEEE May 1998

[10] M Puschel J M F Moura J R Johnson et al ldquoSPIRAL codegeneration for DSP transformsrdquo Proceedings of the IEEE vol 93no 2 pp 232ndash273 2005

[11] EWang Q Zhang B Shen et alHigh-Performance Computingon the Intel Xeon Phi Springer International Publishing 2014

[12] B Akin F Franchetti and J C Hoe ldquoFFTS with near-optimalmemory access through block data layoutsrdquo in Proceedings ofthe 2014 IEEE International Conference onAcoustics Speech andSignal Processing ICASSP 2014 pp 3898ndash3902 Italy May 2014

[13] B Akin F Franchetti and J C Hoe ldquoUnderstanding thedesign space of DRAM-optimized hardware FFT acceleratorsrdquoin Proceedings of the 25th IEEE International Conference onApplication-Specific Systems Architectures and Processors ASAP2014 pp 248ndash255 Switzerland June 2014

[14] C-L Yu K Irick C Chakrabarti and V Narayanan ldquoMul-tidimensional DFT IP generator for FPGA platformsrdquo IEEETransactions on Circuits and Systems I Regular Papers vol 58no 4 pp 755ndash764 2011

[15] DHe andQ Sun ldquoApractical print-scan resilientwatermarkingschemerdquo in Proceedings of the IEEE International Conference onImage Processing 2005 ICIP 2005 pp 257ndash260 Italy September2005

[16] D G Bailey Design for embedded image processing on FPGAsJohn Wiley Sons 2011

[17] B G Batchelor ldquoImplementing Machine Vision Systems UsingFPGAsrdquo in Machine Vision Handbook 1136 p 1103 SpringerLondon UK 2012

[18] T Lenart M Gustafsson and V Owall ldquoA hardware accelera-tion platform for digital holographic imagingrdquo Journal of SignalProcessing Systems vol 52 no 3 pp 297ndash311 2008

[19] G H Loh ldquo3D-Stacked Memory Architectures for Multi-coreProcessorsrdquo ACM SIGARCH Computer Architecture News vol36 no 3 pp 453ndash464 2008

[20] Z Qiuling B Akin H E Sumbul et al ldquoA 3D-stacked logic-in-memory accelerator for application-specific data intensivecomputingrdquo in Proceedings of the IEEE International 3D SystemsIntegration Conference pp 1ndash7 October 2013

[21] I S Uzun A Amira and A Bouridane ldquoFPGA implementa-tions of fast Fourier transforms for real-time signal and imageprocessingrdquo pp 283ndash296

[22] T Dillon ldquoTwo Virtex-II FPGAs deliver fastest cheapest besthigh-performance image processing systemrdquo rdquo Xilinx XcellJournal vol 41 pp 70ndash73 2001

[23] A Hast ldquoRobust and invariant phase based local featurematchingrdquo in Proceedings of the 22nd International Conferenceon Pattern Recognition ICPR 2014 pp 809ndash814 Sweden August2014

[24] B Galerne Y Gousseau and J-M Morel ldquoRandom phasetextures theory and synthesisrdquo IEEE Transactions on ImageProcessing vol 20 no 1 pp 257ndash267 2011

[25] R Hovden Y Jiang H L Xin and L F Kourkoutis ldquoPeriodicArtifact Reduction in Fourier Transforms of Full Field AtomicResolution Imagesrdquo Microscopy and Microanalysis vol 21 no2 pp 436ndash441 2014

[26] D G Bailey ldquoThe advantages and limitations of high levelsynthesis for FPGA based image processingrdquo in Proceedings ofthe the 9th International Conference pp 134ndash139 Seville SpainSeptember 2015

[27] W Wang B Duan C Zhang P Zhang and N Sun ldquoAcceler-ating 2D FFT with non-power-of-two problem size on FPGArdquoin Proceedings of the 2010 International Conference on Recon-figurable Computing and FPGAs ReConFig 2010 pp 208ndash213Mexico December 2010

[28] F NattererTheMathematics of Computerized Tomography JohnWiley amp Sons 1986

[29] R A Crowther D J DeRosier and A Klug ldquoThe Reconstruc-tion of aThree-Dimensional Structure from Projections and itsApplication to Electron Microscopyrdquo Proceedings of the RoyalSociety A Mathematical Physical and Engineering Sciences vol317 no 1530 pp 319ndash340 1970

International Journal of Reconfigurable Computing 17

[30] U Skoglund L-G Ofverstedt R M Burnett and G BricogneldquoMaximum-entropy three-dimensional reconstruction withdeconvolution of the contrast transfer function A test applica-tion with adenovirusrdquo Journal of Structural Biology vol 117 no3 pp 173ndash188 1996

[31] F Mahmood N Shahid P Vandergheynst and U SkoglundldquoGraph-based sinogram denoising for tomographic reconstruc-tionsrdquo in Proceedings of the 38th Annual International Confer-ence of the IEEE Engineering in Medicine and Biology SocietyEMBC 2016 pp 3961ndash3964 USA August 2016

[32] F Mahmood N Shahid U Skoglund and P VandergheynstldquoAdaptive Graph-Based Total Variation for TomographicReconstructionsrdquo IEEE Signal Processing Letters vol 25 no 5pp 700ndash704 2018

[33] L A Shepp and B F Logan Jr ldquoThe Fourier reconstruction of ahead sectionrdquo IEEE Transactions on Nuclear Science vol 21 no3 pp 21ndash43 1974

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 2: Algorithm and Architecture Optimization for 2D Discrete ...downloads.hindawi.com/journals/ijrc/2018/1403181.pdf · Algorithm and Architecture Optimization for 2D Discrete Fourier

2 International Journal of Reconfigurable Computing

flexible platform FPGAs can fully exploit the parallel natureof the FFT algorithm 2D FFTs are generally calculated instages where all elements of the first stage must be availablebefore the second stage can be calculatedThis creates the so-called ldquointermediate storagerdquo problem associated with stridedexternal memory access specifically for large datasets

While calculating 2D DFTs it is assumed that the imageis periodic which is usually not the case The nonperiodicnature of the image leads to artifacts in the Fourier transformusually known as edge artifacts or series termination errorsThese artifacts appear as several crosses of high-amplitudecoefficients in the frequency domain as seen in [15 16]Such edge artifacts can be passed to subsequent stages ofprocessing and in biomedical applications they may leadto critical misinterpretations of results Efficiently removingsuch artifacts without compromising resolution is a majorproblem Moreover simultaneously removing these spuriousartifacts while calculating the 2D FFT adds to the existingcomplexity of the 2D FFT kernel

Contributions In this paper we present solutions for ahigh-performance 2D DFT with simultaneous edge artifactremoval (EAR) for applications which require high framerate 2D FFTs such as real-time medical imaging systems andmachine vision for control Our proposed optimizations leadto a high-performance solution for removing edge artifactswhile the transform is being calculated thus preventingtime consuming andpossibly erroneous postprocessing stepsMoreover the proposed optimizations reduce the overallenergy consumption This work builds on our previous workpresented in [8] Major contributions include the following

(1) We propose optimized periodic plus smooth decom-position (OPSD) as an optimization for standardperiodic plus smooth decomposition (PSD) for edgeartifact removal (Section 4)

(2) Based on OPSD we propose an architecture thatcan reduce the access to DRAM and can decreasethe number of 1D FFT invocations by performingcolumn-by-column operations on the fly (Section 4)

(3) Since OPSD is heavily dependent on efficient FPGA-based 2D FFT implementation which is limitedby DRAM access problems we design a memorymapping scheme based on ldquotile-hoppingrdquo whichcan reduce row activation overhead while accessingcolumns of data from the DRAM (Sections 5 and 6)

(4) The proposed OPSD and memory ldquotile-hoppingrdquooptimizations also lead to better energy performanceas compared to row-major access (Section 64)

(5) We use our implementation as an accelerator forfiltered back-projection (FBP) an analytical tomo-graphic reconstruction method and show that forlarge datasets our 2D FFT with edge artifact removal(EAR) can significantly improve reconstruction runtime (Section 7)

As compared to our previous work [8] the currentimplementation achieves better runtime ie 15ms as com-pared to 324ms for a 512 times 512 image This increased

performance is achieved by using state-of-the-art hardwarefor high-bandwidth communication between the FPGA andCPU modules and by utilizing an efficient memory map-ping scheme The current work also analyzes the energyconsumption of the proposed paradigm Moreover we alsoshow how the real-time 2D FFTs can be used for tomographicreconstructions

Paper Outline The paper follows FPGA image processingdesign methodology outlined in [16 17] which involvescarefully profiling the software solution to understand com-putational bottlenecks and overcoming them through carefulreformulation of the algorithm within a parallel hardwareframework Section 2 gives a comprehensive backgroundof high-performance 2D FFTs using FPGAs the DRAMintermediate storage problem and edge artifacts Section 3presents PSD in detail which is the implementation objectiveSection 4 presents OPSD an optimized solution to reduce thelatency of the serial part of the algorithmwhich limits overallperformance Section 5 presents a memory mapping schemethat can reduce the column-wise strided external memoryaccess Section 6 presents experimental results and explainstarget selection in detail experimental setup and bench-marks results Section 7 presents filtered back-projection asa proposed application Section 8 presents conclusions

2 Background

21 High-Performance 2D FFTs Using FPGAs There areseveral resource-efficient high-throughput implementationapproaches of multidimensional DFTs on a variety of differ-ent platformsMany of thesemethods are software-based andhave been optimized for efficient performance on general-purpose processors (GPPs) for example Intel MKL [11]FFTW [9] and Spiral [10] Implementations on GPPs canbe readily adapted for a variety of scenarios However GPPsconsume more power as compared to dedicated hardwareand are not ideal for real-time embedded applicationsSeveral application-specific integrated circuit- (ASIC-) basedapproaches have also been proposed [18ndash20] but since it isnot easy tomodifyASICs they are not cost-effective solutionsfor rapid prototyping of image processing systems GraphicalProcessing Units (GPUs) on the other hand can achieverelatively high throughput but are energy inefficient and limitthe portability of large-scale imaging systems

Due to their inherent parallelism and reconfigurabilityFPGAs are attractive for accelerating FFT computationssince they fully exploit the parallel nature of the FFT algo-rithm FPGAs are particularly an attractive target for medicaland biomedical imaging apparatus and instruments suchas electron microscopes and tomographic scanners Suchdevices do not have to be manufactured in bulk to justifyapplication-specific solutions and require high bandwidthMoreover increasing mobility and portability constitute afuture objective for many medical imaging systems FPGAsare also more efficient for prototyping machine vision appli-cations since they are relatively more fine-grained whencompared to GPPs and GPUs and can serve as a bridge

International Journal of Reconfigurable Computing 3

Step 1

Inte

rmed

iate

Sto

rage

Stor

age

Optionalfor streaming access

Row-by-Row 1D FFTs Column-by-Column 1D FFTs

1D F

FTs

1D FFTs

Step 2

n

nn

n

(a) Dataset level view

Accessing Each Row toreadwrite just one

element (Worst Case

Jumping between n DRAMRows to access a single column

(Worst Case Scenario)

Scenario)

Fast Streaming RowAccess An entire row

to data is read into theRow Buffer and

utilized

Step 1 Row Access from DRAM (Fast - Streaming) Step 2 Column Access from DRAM (Slow - Strided)

DRAM Row BufferDRAM Row Buffer

(b) Memory level view

Figure 1 (a) An overview of row-column decomposition (RCD) for 2D FFT implementation Intermediate storage is required because allelements of the row-by-row operations must be available for column-by-column processing (b) An overview of strided column-wise accessfromDRAMas compared to trivial row-wise access An entire row of elementsmust be read into the row buffer even to access a single elementwithin a specific row

between general-purpose and application-specific accelera-tion solutions

22 DRAM Intermediate Storage Problem There have beenseveral high-throughput 2D FFT FPGA-based implementa-tions over the past few years Most of these rely on repeatedinvocations of 1D FFTs by row and column decomposition(RCD) with efficient use of memory [5 7 12 21 22] RCDmakes use of the fact that a 2D Fourier transform is separableand can be implemented in stages ie a row-by-row 1DFFT can be proceeded by a column-by-column 1D FFT withintermediate storage (Figure 1) Most of the previous RCD-based 2D FFT FPGA implementation approaches have twomajor design challenges (1) The 1D FFT implementationneeds to have a reasonably high throughput and needs tobe resource efficient Moreover spatial parallelism needs tobe exploited by running several 1D FFTs simultaneously (2)External DRAMneeds to be efficiently addressed and to havea high bandwidth

Since the column-by-column 1D FFT requires data fromall rows intermediate storage becomes a major problem forlarge datasets Many implementations rely on local memorysuch as resource-implemented block RAM for intermediatestorage which is not possible for large datasets [21] Large

datasets have to be offloaded to external DRAM becauseonly a portion of the dataset that fits on the chip can beoperated on at a given time For complex image processingapplications this means repeated storage and access to theexternal memory during every stage of processing

As shown in Figure 2(a) DRAM hierarchy from top tobottom is rank chip bank row and column Each DRAMbank (Figure 2(b)) has a row buffer that holds the mostrecently referred row There is only one row buffer per bankwhichmeans only one row from the data-grid can be accessedat once Before accessing a row it has to be activated bytransferring the contents from internal capacitor storage intoa set of parallel sense amplifiersThe row buffer is the so-calledldquofast bufferrdquo becausewhen a row is activated and placed in thebuffer any element can be accessed at random

If a new row has to be activated and accessed into therow buffer a row buffer miss occurs and requires a higherlatency 119860푚푖푠푠 (Figure 2(c)) On the contrary if the desiredrow is already in the buffer a row buffer hit or page hit occursand the latency to access elements is substantially lower119860ℎ푖푡This implies 119860푚푖푠푠 = 119860ℎ푖푡 + 119862푟 where 119862푟 is the overheadassociated with accessing a new row to read a specificelement (Figure 2(c)) [12] There is also overhead involved inwriting the row back to the data-grid (precharge) say 119862푤

4 International Journal of Reconfigurable Computing

DRAM Module

Bank 1 Bank 2

Bank 3 Bank s

Bank 1 Bank 2

Bank 3 Bank s

Bank 1 Bank 2

Bank 3 Bank s

Rank 1Rank 0

Chip 0

Chip 1

Chip p

IO

Bus

(a) DRAM hierarchy

DRAM BANK

Column AddressDecoder

Data Array

IO

Row

Add

ress

Row Buffer

Dec

oder

Address Column Address

Row

Addr

ess

(b) Single DRAM bank

DecodeRow

RowPrecharge

ActivateRow

DecodeColumn

Row Buffer Miss

ExtraLatency

Row Buffer Hit

(c) Row buffer hit and miss

Figure 2 (a) An overview of the DRAM hierarchy (b) Image showing the structure of a single DRAM bank (c) Flow chart explainingadditional latency introduced when a new row has to be referred to in the row buffer to access a specific element

However both 119862푟 and 119862푤 can be concealed by interleaving(switching between banks) Since row-wise access is trivialthe row-by-row 1D FFT part of RCD-based 2D FFT is easilyaccomplished However once the row-by-row 1D FFT isstored in the DRAM in standard row-major order to accessa single column each row of the DRAM has to be accessedinto the row buffer rendering the read process extremelyinefficient This is typically the major bottleneck for high-throughput 2D FFTs (Figure 1(b)) We address this problemby designing a custommemory mapping scheme (Section 5)

23 Edge Artifacts While calculating 2D DFTs it is assumedthat the image is periodic which is usually not the caseThe nonperiodic nature of the image leads to artifacts in theFourier transform usually known as edge artifacts or seriestermination errors These artifacts appear as several crossesof high-amplitude coefficients in the frequency domain(Figure 3(b)) Such edge artifacts can be passed to subsequentstages of processing and in biomedical applications theymaylead to critical misinterpretations of results No current 2DFFT FPGA implementation addresses this problem directlyThese artifacts may be removed during preprocessing usingmirroring windowing zero padding or postprocessing eg

filtering techniques These techniques are usually computa-tionally intensive involve an increase in image size and alsotend to modify the transform

The most common approach is by ramping the imageat corner pixels to slowly attenuate the edges Ramping isusually accomplished by an apodization function such asa Tukey (tapered cosine) or a Hamming window whichsmoothly reduces the intensity to zero Such an approach canbe implemented on an FPGA as a preprocessing operationby storing the window function in a look-up table (LUT)and multiplying it with the image stream before calculatingthe FFT [16] Although this approach is not extremelycomputationally intensive for small images it inadvertentlyremoves necessary information from the image Loss of thisinformation may have serious consequences if the imageis being further processed with several other images toreconstruct a final image that is used for diagnostics or otherdecision-critical applications Another common method isby mirroring the image from 119873 times 119873 to 2119873 times 2119873 Doingso makes the image periodic and reduces edge artifactsHowever this not only increases the size of the image by 4timesbut also makes the transform symmetric which generates aninaccurate phase component

International Journal of Reconfigurable Computing 5

(a) (b)

(c) (d)

(e) (f)

Figure 3 (1a) An image with nonperiodic boundary (1b) 2D DFTof (1a) (1c) DFT of the smooth component ie the removedartifacts from (1a) (1d) Periodic component ie DFT of (1a) withedge artifacts removed (1e) Reconstructed smooth component (1f)Reconstructed periodic component

Simultaneously removing the edge artifacts while cal-culating a 2D FFT imposes an additional design challengeregardless of the method used However these artifacts mustbe removed in applications where they may be propagated tosubsequent processing levels An ideal method for removingthese artifacts should involve making the image periodicwhile removing minimal information from the image Peri-odic plus smooth decomposition (PSD) first presented byMoisan [4] and used in [23ndash25] is an ideal method forremoving edge artifacts (specifically for biomedical appli-cations) because it does not directly intervene with pixelsbeside those of the boundary and does not increase imagesizeMoreover its inherently parallel naturemakes it ideal fora high-throughput FPGA-based implementation We havefurther optimized the original PSD decomposition algorithmto make the overall implementation much more efficient by

decreasing the number of required 1D FFT invocations andby reducing external DRAM utilization (Section 4)

24 LabVIEW FPGA High-Level Design Environment Amajor concern while designing complex image processinghardware accelerators is how to fully harness on the divide-and-conquer approach Algorithms that have to be mappedto multiple FPGAs are often marred by communicationproblems and custom FPGA boards reduce flexibility forlarge-scale and evolving designs For rapid prototyping ofour algorithms we used LabVIEW FPGA 2016 (NationalInstruments) a robust data-flow-based graphical designenvironment LabVIEW FPGA provides integration withNational Instruments (NI) Xilinx-based reconfigurable hard-ware allowing efficient communication with a host PC andhigh-throughput communication between multiple FPGAsthrough a PXIe (PCI eXtensions for Industry Express) busLabVIEW FPGA also enables us to integrate external Hard-ware Description Language (HDL) code and gives us theflexibility to expand our design for future processing stagesWe used NI PXIe-7976R FPGA boards that have a XilinxKintex 7 FPGA and 2GB high-bandwidth (10GBs) externalmemory This platform has already been extensively used forrapid prototyping of communication standards and protocolsbefore moving to ASIC designs The optimizations anddesigns we present here are scalable to most reconfigurablecomputing-based systems Moreover LabVIEW FPGA pro-vides efficient high-level control over memory via a smartmemory controller

3 Periodic Plus Smooth Decomposition (PSD)for Edge Artifact Removal (EAR)

Periodic plus smooth decomposition (PSD) involves decom-posing the image into a periodic and smooth componentto remove edge artifacts with minimal loss of informationfrom the image [4] This section presents an overview ofthe PSD algorithm and profiles the algorithm for possibleparallelization and optimization to achieve efficient FPGAimplementation

Let us have discrete 119899 by 119898 gray-scale image 119868 on a finitedomain Ω = 0 1 119899 minus 1 times 0 1 119898 minus 1 The discreteFourier transform (DFT) of 119868 is defined as

(119904 119905) = sum(푖푗)isinΩ

119868 (119894 119895) exp(minus1205802120587 (119904119894119899 + 119905119895119898)) (1)

This is equivalent to a matrix multiplication119882119868119881 where

119882 = (((((((

1 1 1 11 119908 1199082 119908푛minus11 1199082 1199084 1199082(푛minus1) 1 119908푛minus2 1199082(푛minus2) 119908(푛minus2)(푛minus1)1 119908푛minus1 1199082(푛minus1) 119908(푛minus1)(푛minus1)

)))))))

(2)

6 International Journal of Reconfigurable Computing

Step B Step C Step D

Step EStep A

DMAFIFODMA

FIFO

DMAFIFO

IO

Host PCFPGA 1

FPGA 2

SmartImagingDevice Host PC

2D FFT(RCD with DRAM

IntermediateStorage)

2D FFT(RCD with DRAM

IntermediateStorage)

ComputePeriodicBoarder

ComputeSmooth

Component

ComputePeriodic

Component

Figure 4 A top-level architecture for OPSD using two FPGAs and a host PC connected over a high-bandwidth bus The steps are associatedwith Algorithm 1

and

119908푘 = exp(minus1198942120587119899 )푘 = exp(minus1198942120587119896119899 ) (3)

119881 has the same structure as119882 but is m-dimensional 119908푘 hasperiod 119899 which means that 119908푘 = 119908푘+푙푛 forall119896 119897 isin N therefore

119882 = ((((((

1 1 1 1 11 119908 1199082 119908푛minus2 119908푛minus11 1199082 1199084 119908푛minus4 119908푛minus2 1 119908푛minus2 119908푛minus4 1199084 11990821 119908푛minus1 119908푛minus2 1199082 1199081))))))

(4)

Since in general 119868 is not (119899119898)-periodic there will behigh-amplitude edge artifacts present in the DFT stemmingfrom sharp discontinuities between the opposing edges ofthe image as shown in Figure 3(b) Reference [4] proposeda decomposition of 119868 into a periodic component 119875 whichis periodic and captures the essence of the image with allhigh-frequency details and a smoothly varying background119878 which recreates the discontinuities at the borders so 119868 =119875+ 119878 Periodic plus smooth decomposition can be computedby first constructing a border image 119861 = 119877 + 119862 where 119877represents the boundary discontinuities when transitioningrow-wise and 119862when going column-wise

119877 (119894 119895)= 119868 (119899 minus 1 minus 119894 119895) minus 119868 (119894 119895) 119894 = 0 or 119894 = 119899 minus 10 otherwise

119862 (119894 119895)= 119868 (119894 119898 minus 1 minus 119895) minus 119868 (119894 119895) 119895 = 0 or 119895 = 119898 minus 10 otherwise

(5)

It is obvious that the structure of the border image119861 is simplewith nonzero values only in the edges as shown below

119861 = 119877 + 119862= (((

11988711 11988712 1198871푚minus1 1198871푚11988721 0 0 minus11988721 119887푛minus11 0 0 minus119887푛minus11119887푛1 minus11988712 minus1198871푚minus1 minus119887푛푚)))

(6)

The DFT of the smooth component 119878 can be then found bythe following formula

(119904 119905) = (119904 119905)2 cos (2120587119904119899) + 2 cos (2120587119905119898) minus 4 forall (119904 119905) isin Ω (0 0) (7)

The DFT of the image 119868 with edge artifacts removed isthen = minus Figures 3(c) and 3(d) show the DFT ofthe smooth and periodic components respectively Figures3(e) and 3(f) show the reconstructed periodic and smoothcomponents On reconstruction it is evident that there isnegligible visual difference between the actual image and theperiodic reconstructed image for this example

31 Profiling PSD for FPGA Implementation Algorithm 1summarizes the overall PSD implementation There areseveral ways of arranging the algorithm We have arrangedit so that DFTs of the periodic and smooth componentsare readily available for further processing stages For bestresults both the periodic and smooth components shouldundergo similar processing stages and should be added backtogether before displaying the result However dependingon the application it might be acceptable to discard theperiodic component completely For an 119899 times 119898 image stepsA and C have a complexity of O(119899119898 log(119899119898)) and steps Band D have complexity O(119898 + 119899) and O(119898119899) respectivelyComputationally the performance of PSD is limited by stepsA and C Figure 4 shows a proposed top-level architecture

International Journal of Reconfigurable Computing 7

where step A and steps B C and D are completed on separateFPGAs while step E can be done on the host PC Two highend FPGAs are used instead of one because resources onone FPGA are insufficient to compute a large size 2D FFTas well its edge artifact removal components The overallperformance may be limited by FPGA 2 where most of theserial part of the algorithm lies There are two major factorswhich limit the throughput of such a design

(1) While FPGA 1 and FPGA 2 can run in parallel theresult of step A from FPGA 1 has to be stored on thehost PC while steps B C and D are completed onFPGA 2 before step E can be completed on the hostPC

(2) The DRAM intermediate storage problem explainedin Section 22 and Figures 1 and 2 has to be addressedsince strided access to the DRAM for column-wiseoperations can significantly limit throughput

As for (1) it has been addressed in the next section wherewe make use of the inherent symmetry of the boundaryimage to reduce the time required to compute the 2D FFTof the boundary image As for (2) it has been addressed bydesigning a semi-custommemory mapping controller whichrdquotilesrdquo the DRAM floor and rdquohopsrdquo between several tiles so asto minimize strided memory access

4 Proposed Optimized Periodic Plus SmoothDecomposition (OPSD)

In this section we optimize the original PSD algorithmThisoptimization is to effectively reduce the number of 1D FFTinvocations and the number of times the DRAM is accessed

Equation (6) shows that except the corners the boundaryimage 119861 is symmetrical in the sense that boundary rows andcolumns are algebraic negation of each other In total119861has 119899+119898 minus 1 unique elements with the following relations betweencorners with respect to columns and rows

11988711 = 11990311 + 119888111198871푚 = 1199031푚 minus 11988811119887푛1 = minus11990311 + 119888푛1119887푛푚 = minus1199031푚 minus 119888푛1(8)

997904rArr 119887푛푚 = minus11988711 minus 1198871푚 minus 119887푛1 (9)

In computing the FFTof119861 one normally proceeds by firstrunning 1D FFTs column-by-column and then 1D FFTs row-by-row or vice versa An FFT of a column vector 119907with length119899 is119882119907 where119882 is given in (4)The column-wise FFT of thematrix 119861 is then

=119882119861 (10)

Let us have a closer look at the first column denoted by 119861sdot1The 1D FFT of this vector is

sdot1 =119882119861sdot1

= (((((((

1 1 11 119908 119908푛minus11 1199082 1199082(푛minus1) 1 119908푛minus2 119908(푛minus2)(푛minus1)1 119908푛minus1 119908(푛minus1)(푛minus1)

)))))))

((((((

119887111198872111988731 119887푛minus11119887푛1

))))))

=((((((((((((((((

푛sum푖=1

119887푖1푛sum푖=1

119887푖1119908푖minus1푛sum푖=1

119887푖11199082(푖minus1) 푛sum푖=1

119887푖1119908(푛minus2)(푖minus1)푛sum푖=1

119887푖1119908(푛minus1)(푖minus1)

))))))))))))))))

(11)

It can be shown that the 1D FFT of the column 119895 isin2 3 119898 minus 1 issdot119895 =119882119861sdot119895

= (((((((

1 1 11 119908 119908푛minus11 1199082 1199082(푛minus1) 1 119908푛minus2 119908(푛minus2)(푛minus1)1 119908푛minus1 119908(푛minus1)(푛minus1)

)))))))

((((((

1198871푗00 0minus1198871푗

))))))

= 1198871푗((((((

01 minus 119908푛minus11 minus 119908푛minus2 1 minus 11990821 minus 119908

))))))

= 1198871푗^

(12)

8 International Journal of Reconfigurable Computing

Input 119868(119894 119895) of size 119899 times 119898Output (119904 119905) (119904 119905)

Step A Compute the 2D DFT of image 119868(119894 119895)1 119868(119894 119895) F997888rarr (119904 119905)Step B Compute periodic border 119861

2 while 1 lt 119895 lt 119898 do3 while 1 lt 119894 lt 119899 do4 if (119894 = 0 or 119894 = 119899 minus 1) then5 119877(119894 119895) larr997888 119868(119899 minus 1 minus 119894 119895) minus 119868(119894 119895)6 else7 119877(119894 119895) larr997888 08 end if9 if (119895 = 0 or 119895 = 119898 minus 1) then10 119862(119894 119895) larr997888 119868(119894 119898 minus 1 minus 119895) minus 119868(119894 119895)11 else12 119862(119894 119895) larr997888 013 end if14 end while15 end while16 119861larr997888 119877 + 119862Step C Compute the 2D DFT of (119904 119905)

17 119861(119894 119895) F997888rarr (119904 119905)Step D Compute the Smooth Component (119904 119905)

18 (119904 119905) larr997888 2 cos(2120587119904119899) + 2 cos(2120587119905119898) minus 419 (119904 119905) larr997888 (119904 119905) divide (119904 119905)

Step E Compute the Periodic Component (119904 119905)20 (119904 119905) larr997888 (119904 119905) minus (119904 119905)21 return (119904 119905) (119904 119905)

Algorithm 1 Periodic plus smooth decomposition (PSD)

and the 1D FFT of the last column 119861sdot119898 is

sdot119898 =119882119861sdot119898 (13)

=((((((((((((((((

minus 푛sum푖=1

119887푖1minus 푛sum푖=1

119887푖1119908푖minus1 + (11988711 + 1198871푚) (1 minus 119908푛minus1)minus 푛sum푖=1

119887푖11199082(푖minus1) + (11988711 + 1198871푚) (1 minus 1199082(푛minus1)) minus 푛sum푖=1

119887푖1119908(푛minus2)(푖minus1) + (11988711 + 1198871푚) (1 minus 119908(푛minus2)(푛minus1))minus 푛sum푖=1

119887푖1119908(푛minus1)(푖minus1) + (11988711 + 1198871푚) (1 minus 119908(푛minus1)(푛minus1))

))))))))))))))))

(14)

sdot119898 = minussdot1 + (11988711 + 1198871푚) ^ (15)

Therefore the column-wise FFT of the matrix 119861 is

= (119861sdot1 11988712^ 1198871푚minus1^ minussdot1 + (11988711 + 1198871푚) ^) (16)

To compute the column-by-column 1DFFT of thematrix119861 we only have to compute the FFT of the first vectorand then use the appropriately scaled vector ^ to derivethe remainder of the columns The row-by-row FFT has

to be calculated in a row burst normal way Algorithm 2presents a summary of the shortcut for calculating (119904 119905)The steps presented in Algorithm 2 can replace step Cin Algorithm 1 By reducing column-by-column 1D FFTcomputations for the boundary image this method cansignificantly reduce the number of 1D FFT invocationsreduce the overall DRAM access and eliminate problematiccolumn-wise strided DRAM access for an efficient FPGA-based implementation For column-wise operations a single1D FFT of size 119898 is required rather than 119899119898 1D FFTs of size119898 Moreover since one has to simply store one column ofdata it can be stored on the on-chip local memory (BRAMor SRAM) This can be implemented by temporarily storingthe initial vector sdot1 and scaling factors 1198871푗 in the blockRAMregister memory drastically reducing DRAM accessand lowering the number of required 1D FFT invocationsPerformance evaluation for this has been presented in theresults section

Table 1 shows a comparison of mirroring PSD andour proposed OPSD with respect to DRAM access pointsMirroring has been used for comparison purposes because itis an alternative technique that reduces edge artifacts whilemaintaining maximum amplitude information Howeverdue to replication of the imagemost of the phase informationis lost Figure 5 graphically shows that our OPSDmethod sig-nificantly reduces reading fromexternalmemory and reduces

International Journal of Reconfigurable Computing 9

Input 119861(119894 119895) of size 119899 times 119898Output (119904 119905)2DF(119861) lArrrArr 119861 F

119888119908997888997888997888rarr 119888119908 F119903119908997888997888997888rarr

Column-by-Column DFT via Symmetrical Short-cut1 119861sdot1

F997888rarr sdot12 while 1 lt 119895 lt 119898 do3 sdot119895 larr997888 1198871푗^4 end while5 sdot119898 larr997888 minussdot1 + (11988711 + 1198871푚)^6 119888119908 larr997888 119862119900119899119888119886119905119890119899119886119905119890 sdot1 sdot2 sdot119898minus1 sdot119898

Row-by-Row DFT7 119888119908

F119903119908997888997888997888rarr (119904 119905)

8 return (119904 119905)cw column-wisecolumn-by-columnrw row-wiserow-by-row

Algorithm 2 Proposed symmetrically optimized computation of (119904 119905)

DRA

M A

cces

s Poi

nts

2

18

16

14

12

1

08

06

04

02

0

Image Size (N=M)0 1000 2000 3000 4000 5000

DRAM Access - PSD vs OPSD

MirroringPSDOPSD (Proposed)

times108

Figure 5 Graph showing DRAM access (equal to number of DFT points to be computed) with increasing image size for mirroring periodicplus smooth decomposition (PSD) and our proposed optimized period plus smooth decomposition (OPSD)

the overall number of DFT computations required It shouldbe noted that such optimization is only possible for eithercolumn-wise or row-wise operations because after either ofthese operations the output is not symmetrical anymoreCompleting the column-wise operation first prevents stridedreading however this results in strided writing to the DRAMbefore row-wise traversal can startThis can be minimized bymaking efficient use of the local block RAM Output columnsare stored in the block RAM before being written to theDRAM in patches such that each row buffer access writeselements in several columns

5 Proposed Tile-Hopping Memory Mappingfor 2D FFTs

In this section we propose a tile-hopping external memoryaccess pattern for efficiently addressing external memory

during intermediate storage between row-wise and column-wise 1D FFT operations to calculate a 2D FFT As explainedin Section 22 and Figure 2 column-wise reads from DRAMcan be costly due to the overhead associated with activatingand precharging In the worst case scenario it can limitDRAM bandwidth up to 80 [26] This is a problem withall such image processing operations where one stage of theprocessing has to be completed on all elements before thenext stage can start In the past there have been severalimplementations using localmemory however with growingdemand for larger image sizes external memory has to beused There have been several DRAM remapping attemptsbefore such as [5 13] They propose a tile-based approachwhere an 119899 times 119899 image (input array) is divided into 119899119896 times 119899119896tiles where 119896 is the size of the DRAM row buffer whichallows for very high-bandwidthDRAMaccess Although this

10 International Journal of Reconfigurable Computing

Table 1 Comparing mirroring PSD and OPSD

Algorithm DRAM Access DFTPoints Points

Mirroring 8119873119872 8119873119872P + S Decomposition (PSD) 4119873119872 4119873119872Optimized PSD (Proposed) 3119873119872 +119873 +119872 minus 1 3119873119872 +119872method may be ideal to maximize the DRAM performancefor 2D FFTs it incurs a high resource cost associated withlocal memory transposition and storing large chunks of data(entire rowcolumn of tiles) in the local memory Moreovertiling in the image domain also requires remapping row-by-row operations Another approach to reducing stridedDRAM access has been presented in [27] They present a2D decomposition algorithmwhich decomposes the probleminto smaller subblock 2D FFTs which can be performedlocallyThis introduces extra row and columndata exchangesand total number of operations are increased fromO(1198992 log 119899)toO(1198992(1+log 119899)) Other implementations do not address theexternal memory issue in detail

We propose tile-hopping address mapping which reducesthe number of row activations required to access a singlecolumn Unlike [5] our approach does not require significantlocal operations or storage The proposed memory mappingcontroller was designed on top of LabVIEW FPGArsquos existingmemory controller which efficiently controls interleaving andissues activation and precharge commands in parallel withdata transfer The reduced number of row activations alsoreduces the amount of energy required by the DRAM Thiswill be further discussed in the energy evaluation presentedin Section 64 and Table 3 where we demonstrate that theproposed tile-hopping address mapping can reduce energyconsumption up to 53

Instead of writing the results of the row-by-row 1D FFTin row-major order we remap the results in a blocked or tiledpattern as shown in Figure 6This means that when accessingan image column several elements of that column can beretrieved from a single DRAM row access For an 119899times119899 imageeach row of size 119899 can be divided into ℎ tiles (ie 119899 = ℎ119873(119905)where119873(119905) is the number of elements in each tile)These tilescan be remapped onto the DRAM floor as shown in Figure 6If the size of the row is small enough it may be possible toconvert it into a single tile (ie ℎ = 1) However this isunlikely for realistic image sizes For a tile of size119901times119902 a singlerow of the image is written into the DRAM by transitioningthrough 119901ℎ rows If 119896 is the size of the row buffer thereare 119896119902 distinct tiles represented in each DRAM row and itcontains the same number of elements from a single imagecolumn Given regular row-major storage when accessingcolumn-wise elements one would have to transition through119899 DRAM rows to read a single image column Howeverwith this approach when accessing an image column 119896119902elements of that column could be read from a single DRAMrow which has been referred to in the row buffer Althoughthe cost of writing an image row is higher when compared toa standard row-major DRAM writing pattern (ie referring

to119901ℎ rather than 119899 rows) the number ofDRAMrow referralsduring column-wise read is reduced to 119899119902119896 which is 119899(1 minus119902119896) less row referrals for a single column read

We refer to this method as tile-hopping because it entailsmapping data onto several DRAM tiles and then hoppingbetween the tiles such that several elements of the imagecolumn exist in a DRAM row which has been referred toin the row bank Although this mapping scheme has beendeveloped for column-wise access required during 2D FFTcalculation the scheme is general and can be adapted to otherapplications Performance evaluation of thismethod has beenpresented in the experiments section

6 Experimental Results and Analysis

61 Hardware Configuration and Target Selection Since 2DDFTs are usually used for simplifying convolution operationsin complex image processing andmachine vision systems weneeded to prototype our design on a system that is expandablefor next levels of processing As mentioned earlier for rapidprototyping of our proposed OPSD algorithm and tile-hopping memory mapping scheme we used a PXIe-basedreconfigurable system PXIe is an industrial extension ofa PCI system with an enhanced bus structure that giveseach connected device dedicated access to the bus with amaximum throughput of 24GBs This allows a high-speeddedicated link between a host PC and several FPGAs TheLabVIEWFPGAgraphical design environment is efficient forrapid prototyping of complicated signal and image processingsystems It allows us to effectively integrate external HDLcode and LabVIEW graphical design on a single platformMoreover it allows a combination of high-level synthesis(HLS) and custom logic Since current HLS tools havelimitations when it comes to complex image and signalprocessing tasks LabVIEW FPGA tries to bridge these gapsby streamlining the design process

We used FlexRIO (Flexible Reconfigurable IO) FPGAboards plugged into a PXIe chassis PXIe FlexRIO FPGAboards are adaptable and can be used to achieve highthroughput because they allow direct data transfer betweenmultiple FPGAs at rates as high as 8GBs This can signifi-cantly simplify multi-FPGA systems which usually commu-nicate via a host PC This feature allows expansion of oursystem to further processing stages making it flexible for avariety of applications Figure 7 shows a basic overview ofa PXIe-based multi-FPGA system with a host PC controllerconnected through a high-speed bus on a PXIe chassis

Specifically we used two NI PXIe-7976R FlexRIO boardswhich have Kintex 7 FPGA and 2GB external DRAM withtheoretical data bandwidth up to 10GBs This FPGA boardwas plugged into a PXIe-1085 chassis along with a PXIe-8880Intel Xeon PC controller PXIe-1085 can hold up to 16 FPGAsand has 8 GBs per-slot dedicated bandwidth and an overallsystem bandwidth of 24 GBs

62 Experimental Setup As per Algorithms 1 and 2 dis-cussed in previous sections implementation involves fivestages (A) calculating the 2D FFT of an image frame (B)

International Journal of Reconfigurable Computing 11

n

n

Row-by-Row (RR) Output

(a) Dataset view

k

p

q

Writing RR Output to DRAM

Row Buffer

(b) Memory view

k

Reading from DRAM

Row Buffer

(c) Memory view

Figure 6 Image showing tile-hopping (a) Image-level view showing tiles (b) DRAM-level view showing tile placement while writing (c)DRAM-level view showing column reading from the tiles

External IO PXIe

Host PC ControllerDisplay

Device

PXIeFPGA Board 1

High-throughput PXIe BUSPXIe Chassis

External IO

PXIeFPGA Board 2

External IO

PXIeFPGA Board 3

External IO

Figure 7 Block diagram of a PXIe-based multi-FPGA system witha host PC controller connected through a high-speed bus on a PXIechassis [8]

calculating the boundary image (C) calculating the 2DFFT of the boundary image (D) calculating the smoothcomponent and (E) subtracting the smooth component fromthe 2D FFT of the original image to achieve the periodiccomponent The bottleneck consistently occurs in A and theserial part of the algorithm (A 997888rarr B 997888rarr C) The limitation

due to A is reduced removing the so-called ldquomemory wallrdquoby using our proposed tile-hopping-based memory mappingThe limitations due to the serial part of the algorithm arereduced by using OPSD rather than PSD For quantificationthe delay for A is 062ms for a 512 times 512 image

The design flow presented in Figure 4 was followedData-flow is clearly shown in a graphical programmingenvironment making it easier to visualize how a designefficiently fits on an FPGA Highly efficient implementationsof 1D FFT were used from LabVIEW FPGA for parallelrow-by-row operations and by integrating Xilinx LogiCOREfor column-by-column operations Each stage of the designwas dynamically tested and benchmarked The image wasstreamed from the host PC using a Direct Memory Access(DMA) FIFO 1D FFTs are performed in parallel rows of 8and stored in the DRAM via local memory in a tiled patternas explained in the previous section This follows readingseveral rows to extract a single column which is Fourier-transformed using Xilinx LogiCORE and is sent back to

12 International Journal of Reconfigurable Computing

I(00)

I(n-1 m-1)

i

j

N x M size image

a b

a

1D FFT

c

Ori

gina

l Im

age

Boun

dary

Imag

e

TrigonometricFunctions

Equation (7)

FFT ofSmooth Component

FFT of Periodic Component(With Edge Artifact

Removed)

b

Shortcut for Column-wise 1D FFT

Exte

rnal

Mem

ory

Exte

rnal

Mem

ory

Row-wise 1D FFT

c

Bloc

k M

emor

y

Exte

rnal

Mem

ory

Row-wise 1D FFT

Host PC Controller(NI PXIe-8880)

FPGA-1(NI PXIe-7976R)

Host PC Controller (NI PXIe-8880)

PXIe Chassis (NI-PXIe1085)

FPGA-2(NI PXIe-7976R)Block Memory

Bloc

k M

emor

y

+

minusColumn-wise 1D FFTlowast

Row-Major Memory Mapping Tile-Hopping Memory Mapping

I(n-1 j) ndash I(0 j)

I(i

0) ndash

I(i

m-1

)

I(i

m-1

) ndash I(

i 0)

I(0 j) ndash I(n-1 j)

middot1

b1jv

-middot1 + (b11+b1m)v

Figure 8 Functional block diagramof PXIe-based 2DFFT implementationwith simultaneous edge artifact removal using optimized periodicplus smooth decomposition The OPSD algorithm is split among two NI-7976R (Kintex-7) FPGA boards with 2GB external memory and ahost PC connected over a high-bandwidth bus The image is streamed from the PC controller to FPGA 1 and FPGA 2 FPGA 1 calculates therow-by-row 1D FFT followed by column-by-column 1D FFT with intermediate tile-hopping memory mapping and sends the result back tothe host PC FPGA 2 receives the image calculates the boundary image and proceeds to calculate the 1D FFT column-by-column FFT usingthe shortcut presented in (16) followed by row-by-row 1D FFTs and the result is sent back to the host PC

Camera

HostPC

External Memory

FPGA Board 1

FPGA

DMAFIFO

FIFO IN(Local Memory - (Local Memory ndash

Write)

FIFO OUT

Read)

PXIe

BU

S

CU

1D FF4H

1D FF41

1D FF42

Figure 9 Block diagram of 2D FFT showing data transfer betweenexternal memory and local memory scheduled via a Control Unit(CU)

the host PC If the image is being streamed directly froman imaging device which scans and provides random ornonlinear sequence of rows it is necessary to store a frameof the image in a buffer This can also be accomplishedby streaming the image flow from the host PC or using asmart camera which can delay image delivery by a singleframe Local memory shown in Figure 9 is used to bufferdata between external memory and 1D FFT cores Thismemory is divided into read and write components and isimplemented using FPGA slices Block RAM (BRAM) is usedfor temporary storage of vectors required for calculating the2D FFT of the boundary image (in the case of FPGA 2)The Control Unit (CU) organizes scheduling for transferring

data between local and external memory CU is based onLabVIEWrsquos existing memory controller and our memorymapping scheme presented in Section 5

Step B was accomplished using standard LabVIEWFPGA HLS tools for programming (5) using the graphicalprogramming environment In step C the 2D FFT of theboundary image needs to be calculated by row and columndecomposition However as shown mathematically in theprevious section the initial row-wise FFTs can be calculatedby computing the 1D FFT of the first (boundary) vector andthe FFTs of reaming vectors can be computed by appropriatescaling of this vector

We need the boundary column vector for 1D FFT calcula-tion of the first and last columns We also need the boundaryrow vector for appropriate scaling of for the 1D FFT ofevery column between the first and last columns Row andcolumn vectors of the boundary image are stored in blockRAM (BRAM) Figure 8 shows a functional block diagramof the overall 2D FFT with optimized PSD process Steps Dand E are performed on the host PC to minimize memoryclashes and to access the periodic and smooth componentsof each frame as they become available 86 of resources areused on FPGA 1 and 41 resources are used on FPGA 2 Theresource utilization is reported according to LabVIEWFPGAsynthesis and compilation experiments It should be notedthat part of the reason for high resource utilization is becauseof using LabVIEW FPGA high-level synthesis tools as wellas Xilinx LogiCORE tools Using standardized tools makesit harder to optimize for resource utilization The currentimplementation was optimized for performance in terms ofrun time and energy consumption

International Journal of Reconfigurable Computing 13

Fram

esS

econ

d (1

s)

105

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7)Kintex-7 (No DRAM Opt)Kintex-7 (DRAM Opt)Theoretical Peak

(a) 2D FFT performance DRAM tile-hopping

Fram

esS

econ

d (1

s)

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7) (PSD)Kintex-7 (PSD)Kintex-7 (OPSD)

(b) 2D FFT with EAR performance PSD versus OPSD

Figure 10 Performance evaluation in terms of frames per second for (a) 2D FFTs with tile-hopping memory pattern (b) 2D FFTs with edgeartifact removal (EAR) using OPSDThe performance evaluation shows the significance of the two optimizations proposed Both axes are ona log scale

63 Performance Evaluation The overall performance of thesystem was evaluated using the setup presented in Figure 9The data was streamed from the host PC in certain caseshigh frame rate videos as well as direct camera input werestreamed from the host All results presented are for 16-bitfixed-point precision Figure 10(a) presents the effectivenessof our proposed tile-hopping memory mapping scheme Itclearly shows the effectiveness of our proposed memorymapping since it is closer to the theoretical peak performanceFigure 10(b) presents the overall results comparing PSD andOPSD and demonstrating the effectiveness of our proposedoptimization PSD was also implemented on the same plat-form but the optimization presented in Section 4 was notused This rendered the serial portion of the algorithm to bethe bottleneck which reduced overall performance Table 2shows a comparison of our implementation in contrast torecent 2D FFT FPGA implementation approaches and showsthat we achieve a better performance even with simultaneousedge artifact removal Although our implementation is testedwith 16-bit fixed-point precision which limits the accuracy ofthe transform the precision may be sufficient for a varietyof speed critical applications where alternative edge artifactremoval methods (eg filtering) may decrease overall systemperformance Although the dynamic range of fixed-pointdata is smaller than floating-point data which can lead toerrors in the 2D FFT for certain applications it is moreimportant to have an artifact-free transform as compared toa highly accurate transform

64 Energy Evaluation The overall energy consumptionof the custom computing system depends on (1) powerperformance of the system components and (2) throughputdisparity Throughput disparity results in idle time for at

least one of the components and lowers the overall sys-tem throughput A throughput-optimized system minimizesinstances where certain components of the architecture areidle The proposed optimizations in Sections 4 and 5 clearlyreduce throughput disparity andminimize the idle time of thesystem Thus besides causing delays due to significant over-head standard column-wise DRAM access also contributesto the overall energy consumption This is not only due tohigh count of DRAM row charges but also because of energyconsumed by the FPGA in idle state Ideally maximizing theDRAMbandwidth limits the amount of energy consumptionProposed ldquotile-hoppingrdquo memory mapping scheme improvesthe DRAMbandwidth as seen in Figure 10 and hence reducesthe overall energy consumption The same is true for theproposed OPSD method where reduced DRAM access and1D FFT invocations lead to reduced energy consumption Inthis section we analyze the amount of improvement in energyconsumption based on the proposed optimizations

We estimate the DRAM power consumption for boththe baseline (standard strided) and the optimized (ldquotile-hoppingrdquo) memory access using MICRON DRAM powercalculator The energy is calculated in 119899119869 for each readie energy per read This is accomplished by calculatingthe run time for a specific 2D FFT and estimating theamount of energy consumed by the DRAM power calculatorTable 4 depicts the DRAM energy consumption for 2D FFTsbefore and after the proposed ldquotile-hoppingrdquo optimization forcolumn-wise DRAM access As mentioned earlier row-wiseDRAM access is fast and row buffer size data can be accessedby a single row activation According to Table 4 the energyrequired for DRAM access is reduced by 427 488 and529 for 1024 times 1024 2048 times 2048 and 4096 times 4096 size 2DFFTs respectively

14 International Journal of Reconfigurable Computing

Table 2 Comparison of OPSD1 2D FFT with regular RCD-based implementation

Platform SEAR2 Precision RT3 (ms)YesNo bits 512 times 512 1024 times 1024

Kintex 7 28nm (ours) Yes 16 (fixed) 15 48Kintex 7 28nm (ours) No 16 (fixed) 09 41Kintex 7 28mm [8] Yes 16 (fixed) 324 1167Stratix IV [5] No 64 (double) - 61Virtex-5-BEE3 65nm[14] No 32 (single) 249 1026Virtex-E 180nm [21] No 16 (fixed) 286 769ASIC 180nm No 32 (single) 210 -1 Optimized periodic + smooth decomposition (OPSD)2 Simultaneous edge artifact removal3 Runtime (ms)

Table 3 DRAM energy consumption baseline vs tile-hopping1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPRlowast CW∘ Read(Baseline) 446 577 712EPR CW ReadTile-Hopping 254 295 336(Proposed)Reduction () 427 488 529lowastEnergy per read (EPR)∘Column-wise memory access (CW)

Table 4 2D FFT + EAR energy consumption baseline vs optimized (OPSD + tile hopping)1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPPdagger 2D FFT + EARBaseline 3692 4125 48352D FFT+EAR (Opt)Tile-Hopping + OPSD 1588 1706 1811(Proposed)Improvement 23times 24times 28timesdaggerEnergy per point (EPP)

The metric used to compare the overall energy optimiza-tion achieved for 2D FFTs with EAR is energy per point iethe amount of average energy required to compute the 2DFFT of a single point in an image with simultaneous edgeartifact removal This was achieved by calculating the energyconsumed by Xilinx LogiCORE IP for 1D FFTs the DRAMand the edge artifact removal part separately The estimatedenergy calculated does not include energy consumed by thePXIe chassis and the host PC Essentially the FPGA-basedarchitecture presented here could be used without the hostcontroller The energy consumption incorporates dynamicas well as static power The overall energy consumption perpoint is reduced by 569 586 and 62 for calculating1024 times 1024 2048 times 2048 and 4096 times 4096 size 2D FFTs withEAR respectively

7 Application Filtered Back-Projectionfor Tomography

In order to further demonstrate the effectiveness of ourimplementation we use the created 2D FFT module asan accelerator for reducing the run time for filtered back-projection (FBP) FBP is a fundamental analytical tomo-graphic image reconstruction method In depth detailsregarding the basic FBP algorithm have been left out forbrevity but can be found in [28 29]Themethod can be usedto reconstruct primitive 3D tomograms from 2D data whichcan then be used as a basis for more complex regularization-basedmethods such as [30ndash32]The algorithmic flow is basedon the Fourier slice theorem ie 2D Fourier transformsof projections are an angular component of the 3D Fourier

International Journal of Reconfigurable Computing 15

Table 5 Comparing filtered back-projection (FBP) runtime (as an application for using the proposed 2D FFT with simultaneous EAR)

3D Density CPU (i7) FPGA + Host PC (i7)Sec Sec128 times 128 times 128 213 sec 195 sec256 times 256 times 256 475 sec 424 sec512 times 512 times 512 948 sec 813 sec1024 times 1024 times 1024 3223 sec 2753 sec2048 times 2048 times 2048 16877 sec 13644 sec4096 times 4096 times 4096 164631 sec 125994 sec

Original Phantom CPU Reconstructed Phantom FPGA+CPU Reconstructed Ph

Inte

nsity

Inte

nsity

1

08

06

04

02

0

1

08

06

04

02

0

X Direction0 20 40 60 80 100 120

X Direction0 20 40 60 80 100 120

Original PhantomCPU BP

Original PhantomFPGA+CPU BP

Figure 11 Figure showing a thin slice of filtered back-projectionresults by reconstructing a 128 times 128 times 128 Shepp-Logan phantomThe 3Ddensitywas reconstructed from 180 equally spaced simulatedprojections using standard linearly interpolated FBP and using aRam-Lak filter It can be seen that the results from the FPGA + CPUsolution have some errors this is due to the fact that the 2D FFT ofeach projection is less accurate The results are good enough to beused as a basis for further optimization based refinement methods

transform of the 3D reconstructed volume Our 2D FFTaccelerator was used to calculate the 2D FFTs of the projec-tions as well as for initial stages of the 3D FFTwhich was thencompleted on the host PC Similar to the 2D FFT the 3D FFTis separable and can be divided into 2D FFTs and 1D FFTsThe results have been shown in Table 5 It can be seen thatthe improvement for smaller size densities is not significantbecause their FFTs are quite fast on general-purpose CPUsHowever for larger densities the FFT accelerator can give asignificant improvement If the remaining components arealso implemented on an FPGA significant speed increasecan be achieved Results of a thin slice from a 3D simulatedShepp-Logan [33] phantom have been shown in Figure 11 Itcan be seen that the results from the hardware acceleratedFBP are of slightly lower quality This is due to the factthat our 2D FFT implementation is less accurate (16 bitfixed-point) as compared to the CPU-based implementation(FFTW double-precision floating) The accelerated FBP wasalso tested with real Electron Tomography (ET) data

8 Conclusion

2D FFTs often become a major bottleneck for high-performance imaging and vision systems The inherentcomputational complexity of the 2D FFT kernel is fur-ther enhanced if effective removal (using PSD) of spuriousartifacts introduced by the nonperiodic nature of real-lifeimages is taken into accountWe developed and implementedan FPGA-based design for calculating high-throughput 2DDFTs with simultaneous edge artifact removal Our approachis based on a PSD algorithm that splits the frequency domainof a 2D image into a smooth component which contains thehigh-frequency cross-shaped artifacts and can be subtractedfrom the 2D DFT of the original image to obtain a periodiccomponent that is artifact-free Since this approach calculatestwo 2D DFTs simultaneously external memory addressingand repeated 1D FFT invocations become problematic Tosolve this problem we optimized the original PSD algorithmto reduce the number of DFT samples to be computed andDRAM access Moreover to reduce strided access from theDRAMduring column-wise reads we presented and analyzedldquotile-hoppingrdquo a memorymapping scheme which reduces thenumber of DRAM row activations when reading a singlecolumn of data This memory mapping scheme is generaland may be used for a variety of other applications Wedemonstrate that ldquotile-hoppingrdquomemorymapping can reducethe DRAM energy consumption by 529 Moreover weshow that the proposed optimizations lead to 28times less energyconsumption for the overall 2D FFT with EAR architecture

Our methods were tested using extensive synthesis andbenchmarking using a Xilinx Kintex 7 FPGA communicatingwith a host PC on a high-speed PXIe bus Our system isexpandable to support several FPGAs and can be adaptedto various large-scale computer vision and biomedical appli-cations Despite decomposing the image into periodic andsmooth frequency components our design requires lessrun time compared to traditional FPGA-based 2D DFTimplementation approaches and can be used for a varietyof highly demanding applications One such applicationfiltered back-projection was accelerated using the proposedimplementation to achieve better results specifically for largersize raw tomographic data

Conflicts of Interest

The authors declare that they have no conflicts of interest

16 International Journal of Reconfigurable Computing

Acknowledgments

Thisworkwas supported by JapaneseGovernmentOIST Sub-sidy for Operations (Ulf Skoglund) Grant no 5020S7010020FaisalMahmood andMart Toots were additionally supportedby the OIST PhD Fellowship The authors would like tothank National Instruments Research for their technicalsupport during the design process The authors would alsolike to thank Dr Steven D Aird for language assistance andShizuka Kuda for logistical arrangements

Supplementary Materials

File 2D FFT-EAR-Demomp4 video demonstrating theFPGA output of 2D FFT with edge artifact removal(Supplementary Materials)

References

[1] R N Bracewell The fourier transform and iis applications vol5 New York NY USA 1965

[2] Theory and application of digital signal processing vol 1Prentice-Hall Inc Englewood Cliffs NJ USA 1975

[3] R A Brooks and G Di Chiro ldquoTheory of image reconstructionin computed tomographyrdquo Radiology vol 117 no 3 I pp 561ndash572 1975

[4] L Moisan ldquoPeriodic plus smooth image decompositionrdquo Jour-nal of Mathematical Imaging and Vision vol 39 no 2 pp 161ndash179 2011

[5] B Akin P A Milder F Franchetti and J C Hoe ldquoMemorybandwidth efficient two-dimensional fast Fourier transformalgorithm and implementation for large problem sizesrdquo inProceedings of the 20th IEEE International Symposium on Field-Programmable Custom Computing Machines FCCM 2012 pp188ndash191 Canada May 2012

[6] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[7] H Kee N Petersen J Kornerup and S S BhattacharyyaldquoSystematic generation of FPGA-based FFT implementationsrdquoin Proceedings of the 2008 IEEE International Conference onAcoustics Speech and Signal Processing ICASSP pp 1413ndash1416USA April 2008

[8] F Mahmood M Toots L-G Ofverstedt and U Skoglundldquo2DDiscrete Fourier Transformwith simultaneous edge artifactremoval for real-Time applicationsrdquo in Proceedings of the Inter-national Conference on Field Programmable Technology FPT2015 pp 236ndash239 New Zealand December 2015

[9] M Frigo and S G Johnson ldquoFFTW an adaptive software archi-tecture for the FFTrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing vol 3 pp1381ndash1384 IEEE May 1998

[10] M Puschel J M F Moura J R Johnson et al ldquoSPIRAL codegeneration for DSP transformsrdquo Proceedings of the IEEE vol 93no 2 pp 232ndash273 2005

[11] EWang Q Zhang B Shen et alHigh-Performance Computingon the Intel Xeon Phi Springer International Publishing 2014

[12] B Akin F Franchetti and J C Hoe ldquoFFTS with near-optimalmemory access through block data layoutsrdquo in Proceedings ofthe 2014 IEEE International Conference onAcoustics Speech andSignal Processing ICASSP 2014 pp 3898ndash3902 Italy May 2014

[13] B Akin F Franchetti and J C Hoe ldquoUnderstanding thedesign space of DRAM-optimized hardware FFT acceleratorsrdquoin Proceedings of the 25th IEEE International Conference onApplication-Specific Systems Architectures and Processors ASAP2014 pp 248ndash255 Switzerland June 2014

[14] C-L Yu K Irick C Chakrabarti and V Narayanan ldquoMul-tidimensional DFT IP generator for FPGA platformsrdquo IEEETransactions on Circuits and Systems I Regular Papers vol 58no 4 pp 755ndash764 2011

[15] DHe andQ Sun ldquoApractical print-scan resilientwatermarkingschemerdquo in Proceedings of the IEEE International Conference onImage Processing 2005 ICIP 2005 pp 257ndash260 Italy September2005

[16] D G Bailey Design for embedded image processing on FPGAsJohn Wiley Sons 2011

[17] B G Batchelor ldquoImplementing Machine Vision Systems UsingFPGAsrdquo in Machine Vision Handbook 1136 p 1103 SpringerLondon UK 2012

[18] T Lenart M Gustafsson and V Owall ldquoA hardware accelera-tion platform for digital holographic imagingrdquo Journal of SignalProcessing Systems vol 52 no 3 pp 297ndash311 2008

[19] G H Loh ldquo3D-Stacked Memory Architectures for Multi-coreProcessorsrdquo ACM SIGARCH Computer Architecture News vol36 no 3 pp 453ndash464 2008

[20] Z Qiuling B Akin H E Sumbul et al ldquoA 3D-stacked logic-in-memory accelerator for application-specific data intensivecomputingrdquo in Proceedings of the IEEE International 3D SystemsIntegration Conference pp 1ndash7 October 2013

[21] I S Uzun A Amira and A Bouridane ldquoFPGA implementa-tions of fast Fourier transforms for real-time signal and imageprocessingrdquo pp 283ndash296

[22] T Dillon ldquoTwo Virtex-II FPGAs deliver fastest cheapest besthigh-performance image processing systemrdquo rdquo Xilinx XcellJournal vol 41 pp 70ndash73 2001

[23] A Hast ldquoRobust and invariant phase based local featurematchingrdquo in Proceedings of the 22nd International Conferenceon Pattern Recognition ICPR 2014 pp 809ndash814 Sweden August2014

[24] B Galerne Y Gousseau and J-M Morel ldquoRandom phasetextures theory and synthesisrdquo IEEE Transactions on ImageProcessing vol 20 no 1 pp 257ndash267 2011

[25] R Hovden Y Jiang H L Xin and L F Kourkoutis ldquoPeriodicArtifact Reduction in Fourier Transforms of Full Field AtomicResolution Imagesrdquo Microscopy and Microanalysis vol 21 no2 pp 436ndash441 2014

[26] D G Bailey ldquoThe advantages and limitations of high levelsynthesis for FPGA based image processingrdquo in Proceedings ofthe the 9th International Conference pp 134ndash139 Seville SpainSeptember 2015

[27] W Wang B Duan C Zhang P Zhang and N Sun ldquoAcceler-ating 2D FFT with non-power-of-two problem size on FPGArdquoin Proceedings of the 2010 International Conference on Recon-figurable Computing and FPGAs ReConFig 2010 pp 208ndash213Mexico December 2010

[28] F NattererTheMathematics of Computerized Tomography JohnWiley amp Sons 1986

[29] R A Crowther D J DeRosier and A Klug ldquoThe Reconstruc-tion of aThree-Dimensional Structure from Projections and itsApplication to Electron Microscopyrdquo Proceedings of the RoyalSociety A Mathematical Physical and Engineering Sciences vol317 no 1530 pp 319ndash340 1970

International Journal of Reconfigurable Computing 17

[30] U Skoglund L-G Ofverstedt R M Burnett and G BricogneldquoMaximum-entropy three-dimensional reconstruction withdeconvolution of the contrast transfer function A test applica-tion with adenovirusrdquo Journal of Structural Biology vol 117 no3 pp 173ndash188 1996

[31] F Mahmood N Shahid P Vandergheynst and U SkoglundldquoGraph-based sinogram denoising for tomographic reconstruc-tionsrdquo in Proceedings of the 38th Annual International Confer-ence of the IEEE Engineering in Medicine and Biology SocietyEMBC 2016 pp 3961ndash3964 USA August 2016

[32] F Mahmood N Shahid U Skoglund and P VandergheynstldquoAdaptive Graph-Based Total Variation for TomographicReconstructionsrdquo IEEE Signal Processing Letters vol 25 no 5pp 700ndash704 2018

[33] L A Shepp and B F Logan Jr ldquoThe Fourier reconstruction of ahead sectionrdquo IEEE Transactions on Nuclear Science vol 21 no3 pp 21ndash43 1974

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 3: Algorithm and Architecture Optimization for 2D Discrete ...downloads.hindawi.com/journals/ijrc/2018/1403181.pdf · Algorithm and Architecture Optimization for 2D Discrete Fourier

International Journal of Reconfigurable Computing 3

Step 1

Inte

rmed

iate

Sto

rage

Stor

age

Optionalfor streaming access

Row-by-Row 1D FFTs Column-by-Column 1D FFTs

1D F

FTs

1D FFTs

Step 2

n

nn

n

(a) Dataset level view

Accessing Each Row toreadwrite just one

element (Worst Case

Jumping between n DRAMRows to access a single column

(Worst Case Scenario)

Scenario)

Fast Streaming RowAccess An entire row

to data is read into theRow Buffer and

utilized

Step 1 Row Access from DRAM (Fast - Streaming) Step 2 Column Access from DRAM (Slow - Strided)

DRAM Row BufferDRAM Row Buffer

(b) Memory level view

Figure 1 (a) An overview of row-column decomposition (RCD) for 2D FFT implementation Intermediate storage is required because allelements of the row-by-row operations must be available for column-by-column processing (b) An overview of strided column-wise accessfromDRAMas compared to trivial row-wise access An entire row of elementsmust be read into the row buffer even to access a single elementwithin a specific row

between general-purpose and application-specific accelera-tion solutions

22 DRAM Intermediate Storage Problem There have beenseveral high-throughput 2D FFT FPGA-based implementa-tions over the past few years Most of these rely on repeatedinvocations of 1D FFTs by row and column decomposition(RCD) with efficient use of memory [5 7 12 21 22] RCDmakes use of the fact that a 2D Fourier transform is separableand can be implemented in stages ie a row-by-row 1DFFT can be proceeded by a column-by-column 1D FFT withintermediate storage (Figure 1) Most of the previous RCD-based 2D FFT FPGA implementation approaches have twomajor design challenges (1) The 1D FFT implementationneeds to have a reasonably high throughput and needs tobe resource efficient Moreover spatial parallelism needs tobe exploited by running several 1D FFTs simultaneously (2)External DRAMneeds to be efficiently addressed and to havea high bandwidth

Since the column-by-column 1D FFT requires data fromall rows intermediate storage becomes a major problem forlarge datasets Many implementations rely on local memorysuch as resource-implemented block RAM for intermediatestorage which is not possible for large datasets [21] Large

datasets have to be offloaded to external DRAM becauseonly a portion of the dataset that fits on the chip can beoperated on at a given time For complex image processingapplications this means repeated storage and access to theexternal memory during every stage of processing

As shown in Figure 2(a) DRAM hierarchy from top tobottom is rank chip bank row and column Each DRAMbank (Figure 2(b)) has a row buffer that holds the mostrecently referred row There is only one row buffer per bankwhichmeans only one row from the data-grid can be accessedat once Before accessing a row it has to be activated bytransferring the contents from internal capacitor storage intoa set of parallel sense amplifiersThe row buffer is the so-calledldquofast bufferrdquo becausewhen a row is activated and placed in thebuffer any element can be accessed at random

If a new row has to be activated and accessed into therow buffer a row buffer miss occurs and requires a higherlatency 119860푚푖푠푠 (Figure 2(c)) On the contrary if the desiredrow is already in the buffer a row buffer hit or page hit occursand the latency to access elements is substantially lower119860ℎ푖푡This implies 119860푚푖푠푠 = 119860ℎ푖푡 + 119862푟 where 119862푟 is the overheadassociated with accessing a new row to read a specificelement (Figure 2(c)) [12] There is also overhead involved inwriting the row back to the data-grid (precharge) say 119862푤

4 International Journal of Reconfigurable Computing

DRAM Module

Bank 1 Bank 2

Bank 3 Bank s

Bank 1 Bank 2

Bank 3 Bank s

Bank 1 Bank 2

Bank 3 Bank s

Rank 1Rank 0

Chip 0

Chip 1

Chip p

IO

Bus

(a) DRAM hierarchy

DRAM BANK

Column AddressDecoder

Data Array

IO

Row

Add

ress

Row Buffer

Dec

oder

Address Column Address

Row

Addr

ess

(b) Single DRAM bank

DecodeRow

RowPrecharge

ActivateRow

DecodeColumn

Row Buffer Miss

ExtraLatency

Row Buffer Hit

(c) Row buffer hit and miss

Figure 2 (a) An overview of the DRAM hierarchy (b) Image showing the structure of a single DRAM bank (c) Flow chart explainingadditional latency introduced when a new row has to be referred to in the row buffer to access a specific element

However both 119862푟 and 119862푤 can be concealed by interleaving(switching between banks) Since row-wise access is trivialthe row-by-row 1D FFT part of RCD-based 2D FFT is easilyaccomplished However once the row-by-row 1D FFT isstored in the DRAM in standard row-major order to accessa single column each row of the DRAM has to be accessedinto the row buffer rendering the read process extremelyinefficient This is typically the major bottleneck for high-throughput 2D FFTs (Figure 1(b)) We address this problemby designing a custommemory mapping scheme (Section 5)

23 Edge Artifacts While calculating 2D DFTs it is assumedthat the image is periodic which is usually not the caseThe nonperiodic nature of the image leads to artifacts in theFourier transform usually known as edge artifacts or seriestermination errors These artifacts appear as several crossesof high-amplitude coefficients in the frequency domain(Figure 3(b)) Such edge artifacts can be passed to subsequentstages of processing and in biomedical applications theymaylead to critical misinterpretations of results No current 2DFFT FPGA implementation addresses this problem directlyThese artifacts may be removed during preprocessing usingmirroring windowing zero padding or postprocessing eg

filtering techniques These techniques are usually computa-tionally intensive involve an increase in image size and alsotend to modify the transform

The most common approach is by ramping the imageat corner pixels to slowly attenuate the edges Ramping isusually accomplished by an apodization function such asa Tukey (tapered cosine) or a Hamming window whichsmoothly reduces the intensity to zero Such an approach canbe implemented on an FPGA as a preprocessing operationby storing the window function in a look-up table (LUT)and multiplying it with the image stream before calculatingthe FFT [16] Although this approach is not extremelycomputationally intensive for small images it inadvertentlyremoves necessary information from the image Loss of thisinformation may have serious consequences if the imageis being further processed with several other images toreconstruct a final image that is used for diagnostics or otherdecision-critical applications Another common method isby mirroring the image from 119873 times 119873 to 2119873 times 2119873 Doingso makes the image periodic and reduces edge artifactsHowever this not only increases the size of the image by 4timesbut also makes the transform symmetric which generates aninaccurate phase component

International Journal of Reconfigurable Computing 5

(a) (b)

(c) (d)

(e) (f)

Figure 3 (1a) An image with nonperiodic boundary (1b) 2D DFTof (1a) (1c) DFT of the smooth component ie the removedartifacts from (1a) (1d) Periodic component ie DFT of (1a) withedge artifacts removed (1e) Reconstructed smooth component (1f)Reconstructed periodic component

Simultaneously removing the edge artifacts while cal-culating a 2D FFT imposes an additional design challengeregardless of the method used However these artifacts mustbe removed in applications where they may be propagated tosubsequent processing levels An ideal method for removingthese artifacts should involve making the image periodicwhile removing minimal information from the image Peri-odic plus smooth decomposition (PSD) first presented byMoisan [4] and used in [23ndash25] is an ideal method forremoving edge artifacts (specifically for biomedical appli-cations) because it does not directly intervene with pixelsbeside those of the boundary and does not increase imagesizeMoreover its inherently parallel naturemakes it ideal fora high-throughput FPGA-based implementation We havefurther optimized the original PSD decomposition algorithmto make the overall implementation much more efficient by

decreasing the number of required 1D FFT invocations andby reducing external DRAM utilization (Section 4)

24 LabVIEW FPGA High-Level Design Environment Amajor concern while designing complex image processinghardware accelerators is how to fully harness on the divide-and-conquer approach Algorithms that have to be mappedto multiple FPGAs are often marred by communicationproblems and custom FPGA boards reduce flexibility forlarge-scale and evolving designs For rapid prototyping ofour algorithms we used LabVIEW FPGA 2016 (NationalInstruments) a robust data-flow-based graphical designenvironment LabVIEW FPGA provides integration withNational Instruments (NI) Xilinx-based reconfigurable hard-ware allowing efficient communication with a host PC andhigh-throughput communication between multiple FPGAsthrough a PXIe (PCI eXtensions for Industry Express) busLabVIEW FPGA also enables us to integrate external Hard-ware Description Language (HDL) code and gives us theflexibility to expand our design for future processing stagesWe used NI PXIe-7976R FPGA boards that have a XilinxKintex 7 FPGA and 2GB high-bandwidth (10GBs) externalmemory This platform has already been extensively used forrapid prototyping of communication standards and protocolsbefore moving to ASIC designs The optimizations anddesigns we present here are scalable to most reconfigurablecomputing-based systems Moreover LabVIEW FPGA pro-vides efficient high-level control over memory via a smartmemory controller

3 Periodic Plus Smooth Decomposition (PSD)for Edge Artifact Removal (EAR)

Periodic plus smooth decomposition (PSD) involves decom-posing the image into a periodic and smooth componentto remove edge artifacts with minimal loss of informationfrom the image [4] This section presents an overview ofthe PSD algorithm and profiles the algorithm for possibleparallelization and optimization to achieve efficient FPGAimplementation

Let us have discrete 119899 by 119898 gray-scale image 119868 on a finitedomain Ω = 0 1 119899 minus 1 times 0 1 119898 minus 1 The discreteFourier transform (DFT) of 119868 is defined as

(119904 119905) = sum(푖푗)isinΩ

119868 (119894 119895) exp(minus1205802120587 (119904119894119899 + 119905119895119898)) (1)

This is equivalent to a matrix multiplication119882119868119881 where

119882 = (((((((

1 1 1 11 119908 1199082 119908푛minus11 1199082 1199084 1199082(푛minus1) 1 119908푛minus2 1199082(푛minus2) 119908(푛minus2)(푛minus1)1 119908푛minus1 1199082(푛minus1) 119908(푛minus1)(푛minus1)

)))))))

(2)

6 International Journal of Reconfigurable Computing

Step B Step C Step D

Step EStep A

DMAFIFODMA

FIFO

DMAFIFO

IO

Host PCFPGA 1

FPGA 2

SmartImagingDevice Host PC

2D FFT(RCD with DRAM

IntermediateStorage)

2D FFT(RCD with DRAM

IntermediateStorage)

ComputePeriodicBoarder

ComputeSmooth

Component

ComputePeriodic

Component

Figure 4 A top-level architecture for OPSD using two FPGAs and a host PC connected over a high-bandwidth bus The steps are associatedwith Algorithm 1

and

119908푘 = exp(minus1198942120587119899 )푘 = exp(minus1198942120587119896119899 ) (3)

119881 has the same structure as119882 but is m-dimensional 119908푘 hasperiod 119899 which means that 119908푘 = 119908푘+푙푛 forall119896 119897 isin N therefore

119882 = ((((((

1 1 1 1 11 119908 1199082 119908푛minus2 119908푛minus11 1199082 1199084 119908푛minus4 119908푛minus2 1 119908푛minus2 119908푛minus4 1199084 11990821 119908푛minus1 119908푛minus2 1199082 1199081))))))

(4)

Since in general 119868 is not (119899119898)-periodic there will behigh-amplitude edge artifacts present in the DFT stemmingfrom sharp discontinuities between the opposing edges ofthe image as shown in Figure 3(b) Reference [4] proposeda decomposition of 119868 into a periodic component 119875 whichis periodic and captures the essence of the image with allhigh-frequency details and a smoothly varying background119878 which recreates the discontinuities at the borders so 119868 =119875+ 119878 Periodic plus smooth decomposition can be computedby first constructing a border image 119861 = 119877 + 119862 where 119877represents the boundary discontinuities when transitioningrow-wise and 119862when going column-wise

119877 (119894 119895)= 119868 (119899 minus 1 minus 119894 119895) minus 119868 (119894 119895) 119894 = 0 or 119894 = 119899 minus 10 otherwise

119862 (119894 119895)= 119868 (119894 119898 minus 1 minus 119895) minus 119868 (119894 119895) 119895 = 0 or 119895 = 119898 minus 10 otherwise

(5)

It is obvious that the structure of the border image119861 is simplewith nonzero values only in the edges as shown below

119861 = 119877 + 119862= (((

11988711 11988712 1198871푚minus1 1198871푚11988721 0 0 minus11988721 119887푛minus11 0 0 minus119887푛minus11119887푛1 minus11988712 minus1198871푚minus1 minus119887푛푚)))

(6)

The DFT of the smooth component 119878 can be then found bythe following formula

(119904 119905) = (119904 119905)2 cos (2120587119904119899) + 2 cos (2120587119905119898) minus 4 forall (119904 119905) isin Ω (0 0) (7)

The DFT of the image 119868 with edge artifacts removed isthen = minus Figures 3(c) and 3(d) show the DFT ofthe smooth and periodic components respectively Figures3(e) and 3(f) show the reconstructed periodic and smoothcomponents On reconstruction it is evident that there isnegligible visual difference between the actual image and theperiodic reconstructed image for this example

31 Profiling PSD for FPGA Implementation Algorithm 1summarizes the overall PSD implementation There areseveral ways of arranging the algorithm We have arrangedit so that DFTs of the periodic and smooth componentsare readily available for further processing stages For bestresults both the periodic and smooth components shouldundergo similar processing stages and should be added backtogether before displaying the result However dependingon the application it might be acceptable to discard theperiodic component completely For an 119899 times 119898 image stepsA and C have a complexity of O(119899119898 log(119899119898)) and steps Band D have complexity O(119898 + 119899) and O(119898119899) respectivelyComputationally the performance of PSD is limited by stepsA and C Figure 4 shows a proposed top-level architecture

International Journal of Reconfigurable Computing 7

where step A and steps B C and D are completed on separateFPGAs while step E can be done on the host PC Two highend FPGAs are used instead of one because resources onone FPGA are insufficient to compute a large size 2D FFTas well its edge artifact removal components The overallperformance may be limited by FPGA 2 where most of theserial part of the algorithm lies There are two major factorswhich limit the throughput of such a design

(1) While FPGA 1 and FPGA 2 can run in parallel theresult of step A from FPGA 1 has to be stored on thehost PC while steps B C and D are completed onFPGA 2 before step E can be completed on the hostPC

(2) The DRAM intermediate storage problem explainedin Section 22 and Figures 1 and 2 has to be addressedsince strided access to the DRAM for column-wiseoperations can significantly limit throughput

As for (1) it has been addressed in the next section wherewe make use of the inherent symmetry of the boundaryimage to reduce the time required to compute the 2D FFTof the boundary image As for (2) it has been addressed bydesigning a semi-custommemory mapping controller whichrdquotilesrdquo the DRAM floor and rdquohopsrdquo between several tiles so asto minimize strided memory access

4 Proposed Optimized Periodic Plus SmoothDecomposition (OPSD)

In this section we optimize the original PSD algorithmThisoptimization is to effectively reduce the number of 1D FFTinvocations and the number of times the DRAM is accessed

Equation (6) shows that except the corners the boundaryimage 119861 is symmetrical in the sense that boundary rows andcolumns are algebraic negation of each other In total119861has 119899+119898 minus 1 unique elements with the following relations betweencorners with respect to columns and rows

11988711 = 11990311 + 119888111198871푚 = 1199031푚 minus 11988811119887푛1 = minus11990311 + 119888푛1119887푛푚 = minus1199031푚 minus 119888푛1(8)

997904rArr 119887푛푚 = minus11988711 minus 1198871푚 minus 119887푛1 (9)

In computing the FFTof119861 one normally proceeds by firstrunning 1D FFTs column-by-column and then 1D FFTs row-by-row or vice versa An FFT of a column vector 119907with length119899 is119882119907 where119882 is given in (4)The column-wise FFT of thematrix 119861 is then

=119882119861 (10)

Let us have a closer look at the first column denoted by 119861sdot1The 1D FFT of this vector is

sdot1 =119882119861sdot1

= (((((((

1 1 11 119908 119908푛minus11 1199082 1199082(푛minus1) 1 119908푛minus2 119908(푛minus2)(푛minus1)1 119908푛minus1 119908(푛minus1)(푛minus1)

)))))))

((((((

119887111198872111988731 119887푛minus11119887푛1

))))))

=((((((((((((((((

푛sum푖=1

119887푖1푛sum푖=1

119887푖1119908푖minus1푛sum푖=1

119887푖11199082(푖minus1) 푛sum푖=1

119887푖1119908(푛minus2)(푖minus1)푛sum푖=1

119887푖1119908(푛minus1)(푖minus1)

))))))))))))))))

(11)

It can be shown that the 1D FFT of the column 119895 isin2 3 119898 minus 1 issdot119895 =119882119861sdot119895

= (((((((

1 1 11 119908 119908푛minus11 1199082 1199082(푛minus1) 1 119908푛minus2 119908(푛minus2)(푛minus1)1 119908푛minus1 119908(푛minus1)(푛minus1)

)))))))

((((((

1198871푗00 0minus1198871푗

))))))

= 1198871푗((((((

01 minus 119908푛minus11 minus 119908푛minus2 1 minus 11990821 minus 119908

))))))

= 1198871푗^

(12)

8 International Journal of Reconfigurable Computing

Input 119868(119894 119895) of size 119899 times 119898Output (119904 119905) (119904 119905)

Step A Compute the 2D DFT of image 119868(119894 119895)1 119868(119894 119895) F997888rarr (119904 119905)Step B Compute periodic border 119861

2 while 1 lt 119895 lt 119898 do3 while 1 lt 119894 lt 119899 do4 if (119894 = 0 or 119894 = 119899 minus 1) then5 119877(119894 119895) larr997888 119868(119899 minus 1 minus 119894 119895) minus 119868(119894 119895)6 else7 119877(119894 119895) larr997888 08 end if9 if (119895 = 0 or 119895 = 119898 minus 1) then10 119862(119894 119895) larr997888 119868(119894 119898 minus 1 minus 119895) minus 119868(119894 119895)11 else12 119862(119894 119895) larr997888 013 end if14 end while15 end while16 119861larr997888 119877 + 119862Step C Compute the 2D DFT of (119904 119905)

17 119861(119894 119895) F997888rarr (119904 119905)Step D Compute the Smooth Component (119904 119905)

18 (119904 119905) larr997888 2 cos(2120587119904119899) + 2 cos(2120587119905119898) minus 419 (119904 119905) larr997888 (119904 119905) divide (119904 119905)

Step E Compute the Periodic Component (119904 119905)20 (119904 119905) larr997888 (119904 119905) minus (119904 119905)21 return (119904 119905) (119904 119905)

Algorithm 1 Periodic plus smooth decomposition (PSD)

and the 1D FFT of the last column 119861sdot119898 is

sdot119898 =119882119861sdot119898 (13)

=((((((((((((((((

minus 푛sum푖=1

119887푖1minus 푛sum푖=1

119887푖1119908푖minus1 + (11988711 + 1198871푚) (1 minus 119908푛minus1)minus 푛sum푖=1

119887푖11199082(푖minus1) + (11988711 + 1198871푚) (1 minus 1199082(푛minus1)) minus 푛sum푖=1

119887푖1119908(푛minus2)(푖minus1) + (11988711 + 1198871푚) (1 minus 119908(푛minus2)(푛minus1))minus 푛sum푖=1

119887푖1119908(푛minus1)(푖minus1) + (11988711 + 1198871푚) (1 minus 119908(푛minus1)(푛minus1))

))))))))))))))))

(14)

sdot119898 = minussdot1 + (11988711 + 1198871푚) ^ (15)

Therefore the column-wise FFT of the matrix 119861 is

= (119861sdot1 11988712^ 1198871푚minus1^ minussdot1 + (11988711 + 1198871푚) ^) (16)

To compute the column-by-column 1DFFT of thematrix119861 we only have to compute the FFT of the first vectorand then use the appropriately scaled vector ^ to derivethe remainder of the columns The row-by-row FFT has

to be calculated in a row burst normal way Algorithm 2presents a summary of the shortcut for calculating (119904 119905)The steps presented in Algorithm 2 can replace step Cin Algorithm 1 By reducing column-by-column 1D FFTcomputations for the boundary image this method cansignificantly reduce the number of 1D FFT invocationsreduce the overall DRAM access and eliminate problematiccolumn-wise strided DRAM access for an efficient FPGA-based implementation For column-wise operations a single1D FFT of size 119898 is required rather than 119899119898 1D FFTs of size119898 Moreover since one has to simply store one column ofdata it can be stored on the on-chip local memory (BRAMor SRAM) This can be implemented by temporarily storingthe initial vector sdot1 and scaling factors 1198871푗 in the blockRAMregister memory drastically reducing DRAM accessand lowering the number of required 1D FFT invocationsPerformance evaluation for this has been presented in theresults section

Table 1 shows a comparison of mirroring PSD andour proposed OPSD with respect to DRAM access pointsMirroring has been used for comparison purposes because itis an alternative technique that reduces edge artifacts whilemaintaining maximum amplitude information Howeverdue to replication of the imagemost of the phase informationis lost Figure 5 graphically shows that our OPSDmethod sig-nificantly reduces reading fromexternalmemory and reduces

International Journal of Reconfigurable Computing 9

Input 119861(119894 119895) of size 119899 times 119898Output (119904 119905)2DF(119861) lArrrArr 119861 F

119888119908997888997888997888rarr 119888119908 F119903119908997888997888997888rarr

Column-by-Column DFT via Symmetrical Short-cut1 119861sdot1

F997888rarr sdot12 while 1 lt 119895 lt 119898 do3 sdot119895 larr997888 1198871푗^4 end while5 sdot119898 larr997888 minussdot1 + (11988711 + 1198871푚)^6 119888119908 larr997888 119862119900119899119888119886119905119890119899119886119905119890 sdot1 sdot2 sdot119898minus1 sdot119898

Row-by-Row DFT7 119888119908

F119903119908997888997888997888rarr (119904 119905)

8 return (119904 119905)cw column-wisecolumn-by-columnrw row-wiserow-by-row

Algorithm 2 Proposed symmetrically optimized computation of (119904 119905)

DRA

M A

cces

s Poi

nts

2

18

16

14

12

1

08

06

04

02

0

Image Size (N=M)0 1000 2000 3000 4000 5000

DRAM Access - PSD vs OPSD

MirroringPSDOPSD (Proposed)

times108

Figure 5 Graph showing DRAM access (equal to number of DFT points to be computed) with increasing image size for mirroring periodicplus smooth decomposition (PSD) and our proposed optimized period plus smooth decomposition (OPSD)

the overall number of DFT computations required It shouldbe noted that such optimization is only possible for eithercolumn-wise or row-wise operations because after either ofthese operations the output is not symmetrical anymoreCompleting the column-wise operation first prevents stridedreading however this results in strided writing to the DRAMbefore row-wise traversal can startThis can be minimized bymaking efficient use of the local block RAM Output columnsare stored in the block RAM before being written to theDRAM in patches such that each row buffer access writeselements in several columns

5 Proposed Tile-Hopping Memory Mappingfor 2D FFTs

In this section we propose a tile-hopping external memoryaccess pattern for efficiently addressing external memory

during intermediate storage between row-wise and column-wise 1D FFT operations to calculate a 2D FFT As explainedin Section 22 and Figure 2 column-wise reads from DRAMcan be costly due to the overhead associated with activatingand precharging In the worst case scenario it can limitDRAM bandwidth up to 80 [26] This is a problem withall such image processing operations where one stage of theprocessing has to be completed on all elements before thenext stage can start In the past there have been severalimplementations using localmemory however with growingdemand for larger image sizes external memory has to beused There have been several DRAM remapping attemptsbefore such as [5 13] They propose a tile-based approachwhere an 119899 times 119899 image (input array) is divided into 119899119896 times 119899119896tiles where 119896 is the size of the DRAM row buffer whichallows for very high-bandwidthDRAMaccess Although this

10 International Journal of Reconfigurable Computing

Table 1 Comparing mirroring PSD and OPSD

Algorithm DRAM Access DFTPoints Points

Mirroring 8119873119872 8119873119872P + S Decomposition (PSD) 4119873119872 4119873119872Optimized PSD (Proposed) 3119873119872 +119873 +119872 minus 1 3119873119872 +119872method may be ideal to maximize the DRAM performancefor 2D FFTs it incurs a high resource cost associated withlocal memory transposition and storing large chunks of data(entire rowcolumn of tiles) in the local memory Moreovertiling in the image domain also requires remapping row-by-row operations Another approach to reducing stridedDRAM access has been presented in [27] They present a2D decomposition algorithmwhich decomposes the probleminto smaller subblock 2D FFTs which can be performedlocallyThis introduces extra row and columndata exchangesand total number of operations are increased fromO(1198992 log 119899)toO(1198992(1+log 119899)) Other implementations do not address theexternal memory issue in detail

We propose tile-hopping address mapping which reducesthe number of row activations required to access a singlecolumn Unlike [5] our approach does not require significantlocal operations or storage The proposed memory mappingcontroller was designed on top of LabVIEW FPGArsquos existingmemory controller which efficiently controls interleaving andissues activation and precharge commands in parallel withdata transfer The reduced number of row activations alsoreduces the amount of energy required by the DRAM Thiswill be further discussed in the energy evaluation presentedin Section 64 and Table 3 where we demonstrate that theproposed tile-hopping address mapping can reduce energyconsumption up to 53

Instead of writing the results of the row-by-row 1D FFTin row-major order we remap the results in a blocked or tiledpattern as shown in Figure 6This means that when accessingan image column several elements of that column can beretrieved from a single DRAM row access For an 119899times119899 imageeach row of size 119899 can be divided into ℎ tiles (ie 119899 = ℎ119873(119905)where119873(119905) is the number of elements in each tile)These tilescan be remapped onto the DRAM floor as shown in Figure 6If the size of the row is small enough it may be possible toconvert it into a single tile (ie ℎ = 1) However this isunlikely for realistic image sizes For a tile of size119901times119902 a singlerow of the image is written into the DRAM by transitioningthrough 119901ℎ rows If 119896 is the size of the row buffer thereare 119896119902 distinct tiles represented in each DRAM row and itcontains the same number of elements from a single imagecolumn Given regular row-major storage when accessingcolumn-wise elements one would have to transition through119899 DRAM rows to read a single image column Howeverwith this approach when accessing an image column 119896119902elements of that column could be read from a single DRAMrow which has been referred to in the row buffer Althoughthe cost of writing an image row is higher when compared toa standard row-major DRAM writing pattern (ie referring

to119901ℎ rather than 119899 rows) the number ofDRAMrow referralsduring column-wise read is reduced to 119899119902119896 which is 119899(1 minus119902119896) less row referrals for a single column read

We refer to this method as tile-hopping because it entailsmapping data onto several DRAM tiles and then hoppingbetween the tiles such that several elements of the imagecolumn exist in a DRAM row which has been referred toin the row bank Although this mapping scheme has beendeveloped for column-wise access required during 2D FFTcalculation the scheme is general and can be adapted to otherapplications Performance evaluation of thismethod has beenpresented in the experiments section

6 Experimental Results and Analysis

61 Hardware Configuration and Target Selection Since 2DDFTs are usually used for simplifying convolution operationsin complex image processing andmachine vision systems weneeded to prototype our design on a system that is expandablefor next levels of processing As mentioned earlier for rapidprototyping of our proposed OPSD algorithm and tile-hopping memory mapping scheme we used a PXIe-basedreconfigurable system PXIe is an industrial extension ofa PCI system with an enhanced bus structure that giveseach connected device dedicated access to the bus with amaximum throughput of 24GBs This allows a high-speeddedicated link between a host PC and several FPGAs TheLabVIEWFPGAgraphical design environment is efficient forrapid prototyping of complicated signal and image processingsystems It allows us to effectively integrate external HDLcode and LabVIEW graphical design on a single platformMoreover it allows a combination of high-level synthesis(HLS) and custom logic Since current HLS tools havelimitations when it comes to complex image and signalprocessing tasks LabVIEW FPGA tries to bridge these gapsby streamlining the design process

We used FlexRIO (Flexible Reconfigurable IO) FPGAboards plugged into a PXIe chassis PXIe FlexRIO FPGAboards are adaptable and can be used to achieve highthroughput because they allow direct data transfer betweenmultiple FPGAs at rates as high as 8GBs This can signifi-cantly simplify multi-FPGA systems which usually commu-nicate via a host PC This feature allows expansion of oursystem to further processing stages making it flexible for avariety of applications Figure 7 shows a basic overview ofa PXIe-based multi-FPGA system with a host PC controllerconnected through a high-speed bus on a PXIe chassis

Specifically we used two NI PXIe-7976R FlexRIO boardswhich have Kintex 7 FPGA and 2GB external DRAM withtheoretical data bandwidth up to 10GBs This FPGA boardwas plugged into a PXIe-1085 chassis along with a PXIe-8880Intel Xeon PC controller PXIe-1085 can hold up to 16 FPGAsand has 8 GBs per-slot dedicated bandwidth and an overallsystem bandwidth of 24 GBs

62 Experimental Setup As per Algorithms 1 and 2 dis-cussed in previous sections implementation involves fivestages (A) calculating the 2D FFT of an image frame (B)

International Journal of Reconfigurable Computing 11

n

n

Row-by-Row (RR) Output

(a) Dataset view

k

p

q

Writing RR Output to DRAM

Row Buffer

(b) Memory view

k

Reading from DRAM

Row Buffer

(c) Memory view

Figure 6 Image showing tile-hopping (a) Image-level view showing tiles (b) DRAM-level view showing tile placement while writing (c)DRAM-level view showing column reading from the tiles

External IO PXIe

Host PC ControllerDisplay

Device

PXIeFPGA Board 1

High-throughput PXIe BUSPXIe Chassis

External IO

PXIeFPGA Board 2

External IO

PXIeFPGA Board 3

External IO

Figure 7 Block diagram of a PXIe-based multi-FPGA system witha host PC controller connected through a high-speed bus on a PXIechassis [8]

calculating the boundary image (C) calculating the 2DFFT of the boundary image (D) calculating the smoothcomponent and (E) subtracting the smooth component fromthe 2D FFT of the original image to achieve the periodiccomponent The bottleneck consistently occurs in A and theserial part of the algorithm (A 997888rarr B 997888rarr C) The limitation

due to A is reduced removing the so-called ldquomemory wallrdquoby using our proposed tile-hopping-based memory mappingThe limitations due to the serial part of the algorithm arereduced by using OPSD rather than PSD For quantificationthe delay for A is 062ms for a 512 times 512 image

The design flow presented in Figure 4 was followedData-flow is clearly shown in a graphical programmingenvironment making it easier to visualize how a designefficiently fits on an FPGA Highly efficient implementationsof 1D FFT were used from LabVIEW FPGA for parallelrow-by-row operations and by integrating Xilinx LogiCOREfor column-by-column operations Each stage of the designwas dynamically tested and benchmarked The image wasstreamed from the host PC using a Direct Memory Access(DMA) FIFO 1D FFTs are performed in parallel rows of 8and stored in the DRAM via local memory in a tiled patternas explained in the previous section This follows readingseveral rows to extract a single column which is Fourier-transformed using Xilinx LogiCORE and is sent back to

12 International Journal of Reconfigurable Computing

I(00)

I(n-1 m-1)

i

j

N x M size image

a b

a

1D FFT

c

Ori

gina

l Im

age

Boun

dary

Imag

e

TrigonometricFunctions

Equation (7)

FFT ofSmooth Component

FFT of Periodic Component(With Edge Artifact

Removed)

b

Shortcut for Column-wise 1D FFT

Exte

rnal

Mem

ory

Exte

rnal

Mem

ory

Row-wise 1D FFT

c

Bloc

k M

emor

y

Exte

rnal

Mem

ory

Row-wise 1D FFT

Host PC Controller(NI PXIe-8880)

FPGA-1(NI PXIe-7976R)

Host PC Controller (NI PXIe-8880)

PXIe Chassis (NI-PXIe1085)

FPGA-2(NI PXIe-7976R)Block Memory

Bloc

k M

emor

y

+

minusColumn-wise 1D FFTlowast

Row-Major Memory Mapping Tile-Hopping Memory Mapping

I(n-1 j) ndash I(0 j)

I(i

0) ndash

I(i

m-1

)

I(i

m-1

) ndash I(

i 0)

I(0 j) ndash I(n-1 j)

middot1

b1jv

-middot1 + (b11+b1m)v

Figure 8 Functional block diagramof PXIe-based 2DFFT implementationwith simultaneous edge artifact removal using optimized periodicplus smooth decomposition The OPSD algorithm is split among two NI-7976R (Kintex-7) FPGA boards with 2GB external memory and ahost PC connected over a high-bandwidth bus The image is streamed from the PC controller to FPGA 1 and FPGA 2 FPGA 1 calculates therow-by-row 1D FFT followed by column-by-column 1D FFT with intermediate tile-hopping memory mapping and sends the result back tothe host PC FPGA 2 receives the image calculates the boundary image and proceeds to calculate the 1D FFT column-by-column FFT usingthe shortcut presented in (16) followed by row-by-row 1D FFTs and the result is sent back to the host PC

Camera

HostPC

External Memory

FPGA Board 1

FPGA

DMAFIFO

FIFO IN(Local Memory - (Local Memory ndash

Write)

FIFO OUT

Read)

PXIe

BU

S

CU

1D FF4H

1D FF41

1D FF42

Figure 9 Block diagram of 2D FFT showing data transfer betweenexternal memory and local memory scheduled via a Control Unit(CU)

the host PC If the image is being streamed directly froman imaging device which scans and provides random ornonlinear sequence of rows it is necessary to store a frameof the image in a buffer This can also be accomplishedby streaming the image flow from the host PC or using asmart camera which can delay image delivery by a singleframe Local memory shown in Figure 9 is used to bufferdata between external memory and 1D FFT cores Thismemory is divided into read and write components and isimplemented using FPGA slices Block RAM (BRAM) is usedfor temporary storage of vectors required for calculating the2D FFT of the boundary image (in the case of FPGA 2)The Control Unit (CU) organizes scheduling for transferring

data between local and external memory CU is based onLabVIEWrsquos existing memory controller and our memorymapping scheme presented in Section 5

Step B was accomplished using standard LabVIEWFPGA HLS tools for programming (5) using the graphicalprogramming environment In step C the 2D FFT of theboundary image needs to be calculated by row and columndecomposition However as shown mathematically in theprevious section the initial row-wise FFTs can be calculatedby computing the 1D FFT of the first (boundary) vector andthe FFTs of reaming vectors can be computed by appropriatescaling of this vector

We need the boundary column vector for 1D FFT calcula-tion of the first and last columns We also need the boundaryrow vector for appropriate scaling of for the 1D FFT ofevery column between the first and last columns Row andcolumn vectors of the boundary image are stored in blockRAM (BRAM) Figure 8 shows a functional block diagramof the overall 2D FFT with optimized PSD process Steps Dand E are performed on the host PC to minimize memoryclashes and to access the periodic and smooth componentsof each frame as they become available 86 of resources areused on FPGA 1 and 41 resources are used on FPGA 2 Theresource utilization is reported according to LabVIEWFPGAsynthesis and compilation experiments It should be notedthat part of the reason for high resource utilization is becauseof using LabVIEW FPGA high-level synthesis tools as wellas Xilinx LogiCORE tools Using standardized tools makesit harder to optimize for resource utilization The currentimplementation was optimized for performance in terms ofrun time and energy consumption

International Journal of Reconfigurable Computing 13

Fram

esS

econ

d (1

s)

105

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7)Kintex-7 (No DRAM Opt)Kintex-7 (DRAM Opt)Theoretical Peak

(a) 2D FFT performance DRAM tile-hopping

Fram

esS

econ

d (1

s)

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7) (PSD)Kintex-7 (PSD)Kintex-7 (OPSD)

(b) 2D FFT with EAR performance PSD versus OPSD

Figure 10 Performance evaluation in terms of frames per second for (a) 2D FFTs with tile-hopping memory pattern (b) 2D FFTs with edgeartifact removal (EAR) using OPSDThe performance evaluation shows the significance of the two optimizations proposed Both axes are ona log scale

63 Performance Evaluation The overall performance of thesystem was evaluated using the setup presented in Figure 9The data was streamed from the host PC in certain caseshigh frame rate videos as well as direct camera input werestreamed from the host All results presented are for 16-bitfixed-point precision Figure 10(a) presents the effectivenessof our proposed tile-hopping memory mapping scheme Itclearly shows the effectiveness of our proposed memorymapping since it is closer to the theoretical peak performanceFigure 10(b) presents the overall results comparing PSD andOPSD and demonstrating the effectiveness of our proposedoptimization PSD was also implemented on the same plat-form but the optimization presented in Section 4 was notused This rendered the serial portion of the algorithm to bethe bottleneck which reduced overall performance Table 2shows a comparison of our implementation in contrast torecent 2D FFT FPGA implementation approaches and showsthat we achieve a better performance even with simultaneousedge artifact removal Although our implementation is testedwith 16-bit fixed-point precision which limits the accuracy ofthe transform the precision may be sufficient for a varietyof speed critical applications where alternative edge artifactremoval methods (eg filtering) may decrease overall systemperformance Although the dynamic range of fixed-pointdata is smaller than floating-point data which can lead toerrors in the 2D FFT for certain applications it is moreimportant to have an artifact-free transform as compared toa highly accurate transform

64 Energy Evaluation The overall energy consumptionof the custom computing system depends on (1) powerperformance of the system components and (2) throughputdisparity Throughput disparity results in idle time for at

least one of the components and lowers the overall sys-tem throughput A throughput-optimized system minimizesinstances where certain components of the architecture areidle The proposed optimizations in Sections 4 and 5 clearlyreduce throughput disparity andminimize the idle time of thesystem Thus besides causing delays due to significant over-head standard column-wise DRAM access also contributesto the overall energy consumption This is not only due tohigh count of DRAM row charges but also because of energyconsumed by the FPGA in idle state Ideally maximizing theDRAMbandwidth limits the amount of energy consumptionProposed ldquotile-hoppingrdquo memory mapping scheme improvesthe DRAMbandwidth as seen in Figure 10 and hence reducesthe overall energy consumption The same is true for theproposed OPSD method where reduced DRAM access and1D FFT invocations lead to reduced energy consumption Inthis section we analyze the amount of improvement in energyconsumption based on the proposed optimizations

We estimate the DRAM power consumption for boththe baseline (standard strided) and the optimized (ldquotile-hoppingrdquo) memory access using MICRON DRAM powercalculator The energy is calculated in 119899119869 for each readie energy per read This is accomplished by calculatingthe run time for a specific 2D FFT and estimating theamount of energy consumed by the DRAM power calculatorTable 4 depicts the DRAM energy consumption for 2D FFTsbefore and after the proposed ldquotile-hoppingrdquo optimization forcolumn-wise DRAM access As mentioned earlier row-wiseDRAM access is fast and row buffer size data can be accessedby a single row activation According to Table 4 the energyrequired for DRAM access is reduced by 427 488 and529 for 1024 times 1024 2048 times 2048 and 4096 times 4096 size 2DFFTs respectively

14 International Journal of Reconfigurable Computing

Table 2 Comparison of OPSD1 2D FFT with regular RCD-based implementation

Platform SEAR2 Precision RT3 (ms)YesNo bits 512 times 512 1024 times 1024

Kintex 7 28nm (ours) Yes 16 (fixed) 15 48Kintex 7 28nm (ours) No 16 (fixed) 09 41Kintex 7 28mm [8] Yes 16 (fixed) 324 1167Stratix IV [5] No 64 (double) - 61Virtex-5-BEE3 65nm[14] No 32 (single) 249 1026Virtex-E 180nm [21] No 16 (fixed) 286 769ASIC 180nm No 32 (single) 210 -1 Optimized periodic + smooth decomposition (OPSD)2 Simultaneous edge artifact removal3 Runtime (ms)

Table 3 DRAM energy consumption baseline vs tile-hopping1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPRlowast CW∘ Read(Baseline) 446 577 712EPR CW ReadTile-Hopping 254 295 336(Proposed)Reduction () 427 488 529lowastEnergy per read (EPR)∘Column-wise memory access (CW)

Table 4 2D FFT + EAR energy consumption baseline vs optimized (OPSD + tile hopping)1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPPdagger 2D FFT + EARBaseline 3692 4125 48352D FFT+EAR (Opt)Tile-Hopping + OPSD 1588 1706 1811(Proposed)Improvement 23times 24times 28timesdaggerEnergy per point (EPP)

The metric used to compare the overall energy optimiza-tion achieved for 2D FFTs with EAR is energy per point iethe amount of average energy required to compute the 2DFFT of a single point in an image with simultaneous edgeartifact removal This was achieved by calculating the energyconsumed by Xilinx LogiCORE IP for 1D FFTs the DRAMand the edge artifact removal part separately The estimatedenergy calculated does not include energy consumed by thePXIe chassis and the host PC Essentially the FPGA-basedarchitecture presented here could be used without the hostcontroller The energy consumption incorporates dynamicas well as static power The overall energy consumption perpoint is reduced by 569 586 and 62 for calculating1024 times 1024 2048 times 2048 and 4096 times 4096 size 2D FFTs withEAR respectively

7 Application Filtered Back-Projectionfor Tomography

In order to further demonstrate the effectiveness of ourimplementation we use the created 2D FFT module asan accelerator for reducing the run time for filtered back-projection (FBP) FBP is a fundamental analytical tomo-graphic image reconstruction method In depth detailsregarding the basic FBP algorithm have been left out forbrevity but can be found in [28 29]Themethod can be usedto reconstruct primitive 3D tomograms from 2D data whichcan then be used as a basis for more complex regularization-basedmethods such as [30ndash32]The algorithmic flow is basedon the Fourier slice theorem ie 2D Fourier transformsof projections are an angular component of the 3D Fourier

International Journal of Reconfigurable Computing 15

Table 5 Comparing filtered back-projection (FBP) runtime (as an application for using the proposed 2D FFT with simultaneous EAR)

3D Density CPU (i7) FPGA + Host PC (i7)Sec Sec128 times 128 times 128 213 sec 195 sec256 times 256 times 256 475 sec 424 sec512 times 512 times 512 948 sec 813 sec1024 times 1024 times 1024 3223 sec 2753 sec2048 times 2048 times 2048 16877 sec 13644 sec4096 times 4096 times 4096 164631 sec 125994 sec

Original Phantom CPU Reconstructed Phantom FPGA+CPU Reconstructed Ph

Inte

nsity

Inte

nsity

1

08

06

04

02

0

1

08

06

04

02

0

X Direction0 20 40 60 80 100 120

X Direction0 20 40 60 80 100 120

Original PhantomCPU BP

Original PhantomFPGA+CPU BP

Figure 11 Figure showing a thin slice of filtered back-projectionresults by reconstructing a 128 times 128 times 128 Shepp-Logan phantomThe 3Ddensitywas reconstructed from 180 equally spaced simulatedprojections using standard linearly interpolated FBP and using aRam-Lak filter It can be seen that the results from the FPGA + CPUsolution have some errors this is due to the fact that the 2D FFT ofeach projection is less accurate The results are good enough to beused as a basis for further optimization based refinement methods

transform of the 3D reconstructed volume Our 2D FFTaccelerator was used to calculate the 2D FFTs of the projec-tions as well as for initial stages of the 3D FFTwhich was thencompleted on the host PC Similar to the 2D FFT the 3D FFTis separable and can be divided into 2D FFTs and 1D FFTsThe results have been shown in Table 5 It can be seen thatthe improvement for smaller size densities is not significantbecause their FFTs are quite fast on general-purpose CPUsHowever for larger densities the FFT accelerator can give asignificant improvement If the remaining components arealso implemented on an FPGA significant speed increasecan be achieved Results of a thin slice from a 3D simulatedShepp-Logan [33] phantom have been shown in Figure 11 Itcan be seen that the results from the hardware acceleratedFBP are of slightly lower quality This is due to the factthat our 2D FFT implementation is less accurate (16 bitfixed-point) as compared to the CPU-based implementation(FFTW double-precision floating) The accelerated FBP wasalso tested with real Electron Tomography (ET) data

8 Conclusion

2D FFTs often become a major bottleneck for high-performance imaging and vision systems The inherentcomputational complexity of the 2D FFT kernel is fur-ther enhanced if effective removal (using PSD) of spuriousartifacts introduced by the nonperiodic nature of real-lifeimages is taken into accountWe developed and implementedan FPGA-based design for calculating high-throughput 2DDFTs with simultaneous edge artifact removal Our approachis based on a PSD algorithm that splits the frequency domainof a 2D image into a smooth component which contains thehigh-frequency cross-shaped artifacts and can be subtractedfrom the 2D DFT of the original image to obtain a periodiccomponent that is artifact-free Since this approach calculatestwo 2D DFTs simultaneously external memory addressingand repeated 1D FFT invocations become problematic Tosolve this problem we optimized the original PSD algorithmto reduce the number of DFT samples to be computed andDRAM access Moreover to reduce strided access from theDRAMduring column-wise reads we presented and analyzedldquotile-hoppingrdquo a memorymapping scheme which reduces thenumber of DRAM row activations when reading a singlecolumn of data This memory mapping scheme is generaland may be used for a variety of other applications Wedemonstrate that ldquotile-hoppingrdquomemorymapping can reducethe DRAM energy consumption by 529 Moreover weshow that the proposed optimizations lead to 28times less energyconsumption for the overall 2D FFT with EAR architecture

Our methods were tested using extensive synthesis andbenchmarking using a Xilinx Kintex 7 FPGA communicatingwith a host PC on a high-speed PXIe bus Our system isexpandable to support several FPGAs and can be adaptedto various large-scale computer vision and biomedical appli-cations Despite decomposing the image into periodic andsmooth frequency components our design requires lessrun time compared to traditional FPGA-based 2D DFTimplementation approaches and can be used for a varietyof highly demanding applications One such applicationfiltered back-projection was accelerated using the proposedimplementation to achieve better results specifically for largersize raw tomographic data

Conflicts of Interest

The authors declare that they have no conflicts of interest

16 International Journal of Reconfigurable Computing

Acknowledgments

Thisworkwas supported by JapaneseGovernmentOIST Sub-sidy for Operations (Ulf Skoglund) Grant no 5020S7010020FaisalMahmood andMart Toots were additionally supportedby the OIST PhD Fellowship The authors would like tothank National Instruments Research for their technicalsupport during the design process The authors would alsolike to thank Dr Steven D Aird for language assistance andShizuka Kuda for logistical arrangements

Supplementary Materials

File 2D FFT-EAR-Demomp4 video demonstrating theFPGA output of 2D FFT with edge artifact removal(Supplementary Materials)

References

[1] R N Bracewell The fourier transform and iis applications vol5 New York NY USA 1965

[2] Theory and application of digital signal processing vol 1Prentice-Hall Inc Englewood Cliffs NJ USA 1975

[3] R A Brooks and G Di Chiro ldquoTheory of image reconstructionin computed tomographyrdquo Radiology vol 117 no 3 I pp 561ndash572 1975

[4] L Moisan ldquoPeriodic plus smooth image decompositionrdquo Jour-nal of Mathematical Imaging and Vision vol 39 no 2 pp 161ndash179 2011

[5] B Akin P A Milder F Franchetti and J C Hoe ldquoMemorybandwidth efficient two-dimensional fast Fourier transformalgorithm and implementation for large problem sizesrdquo inProceedings of the 20th IEEE International Symposium on Field-Programmable Custom Computing Machines FCCM 2012 pp188ndash191 Canada May 2012

[6] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[7] H Kee N Petersen J Kornerup and S S BhattacharyyaldquoSystematic generation of FPGA-based FFT implementationsrdquoin Proceedings of the 2008 IEEE International Conference onAcoustics Speech and Signal Processing ICASSP pp 1413ndash1416USA April 2008

[8] F Mahmood M Toots L-G Ofverstedt and U Skoglundldquo2DDiscrete Fourier Transformwith simultaneous edge artifactremoval for real-Time applicationsrdquo in Proceedings of the Inter-national Conference on Field Programmable Technology FPT2015 pp 236ndash239 New Zealand December 2015

[9] M Frigo and S G Johnson ldquoFFTW an adaptive software archi-tecture for the FFTrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing vol 3 pp1381ndash1384 IEEE May 1998

[10] M Puschel J M F Moura J R Johnson et al ldquoSPIRAL codegeneration for DSP transformsrdquo Proceedings of the IEEE vol 93no 2 pp 232ndash273 2005

[11] EWang Q Zhang B Shen et alHigh-Performance Computingon the Intel Xeon Phi Springer International Publishing 2014

[12] B Akin F Franchetti and J C Hoe ldquoFFTS with near-optimalmemory access through block data layoutsrdquo in Proceedings ofthe 2014 IEEE International Conference onAcoustics Speech andSignal Processing ICASSP 2014 pp 3898ndash3902 Italy May 2014

[13] B Akin F Franchetti and J C Hoe ldquoUnderstanding thedesign space of DRAM-optimized hardware FFT acceleratorsrdquoin Proceedings of the 25th IEEE International Conference onApplication-Specific Systems Architectures and Processors ASAP2014 pp 248ndash255 Switzerland June 2014

[14] C-L Yu K Irick C Chakrabarti and V Narayanan ldquoMul-tidimensional DFT IP generator for FPGA platformsrdquo IEEETransactions on Circuits and Systems I Regular Papers vol 58no 4 pp 755ndash764 2011

[15] DHe andQ Sun ldquoApractical print-scan resilientwatermarkingschemerdquo in Proceedings of the IEEE International Conference onImage Processing 2005 ICIP 2005 pp 257ndash260 Italy September2005

[16] D G Bailey Design for embedded image processing on FPGAsJohn Wiley Sons 2011

[17] B G Batchelor ldquoImplementing Machine Vision Systems UsingFPGAsrdquo in Machine Vision Handbook 1136 p 1103 SpringerLondon UK 2012

[18] T Lenart M Gustafsson and V Owall ldquoA hardware accelera-tion platform for digital holographic imagingrdquo Journal of SignalProcessing Systems vol 52 no 3 pp 297ndash311 2008

[19] G H Loh ldquo3D-Stacked Memory Architectures for Multi-coreProcessorsrdquo ACM SIGARCH Computer Architecture News vol36 no 3 pp 453ndash464 2008

[20] Z Qiuling B Akin H E Sumbul et al ldquoA 3D-stacked logic-in-memory accelerator for application-specific data intensivecomputingrdquo in Proceedings of the IEEE International 3D SystemsIntegration Conference pp 1ndash7 October 2013

[21] I S Uzun A Amira and A Bouridane ldquoFPGA implementa-tions of fast Fourier transforms for real-time signal and imageprocessingrdquo pp 283ndash296

[22] T Dillon ldquoTwo Virtex-II FPGAs deliver fastest cheapest besthigh-performance image processing systemrdquo rdquo Xilinx XcellJournal vol 41 pp 70ndash73 2001

[23] A Hast ldquoRobust and invariant phase based local featurematchingrdquo in Proceedings of the 22nd International Conferenceon Pattern Recognition ICPR 2014 pp 809ndash814 Sweden August2014

[24] B Galerne Y Gousseau and J-M Morel ldquoRandom phasetextures theory and synthesisrdquo IEEE Transactions on ImageProcessing vol 20 no 1 pp 257ndash267 2011

[25] R Hovden Y Jiang H L Xin and L F Kourkoutis ldquoPeriodicArtifact Reduction in Fourier Transforms of Full Field AtomicResolution Imagesrdquo Microscopy and Microanalysis vol 21 no2 pp 436ndash441 2014

[26] D G Bailey ldquoThe advantages and limitations of high levelsynthesis for FPGA based image processingrdquo in Proceedings ofthe the 9th International Conference pp 134ndash139 Seville SpainSeptember 2015

[27] W Wang B Duan C Zhang P Zhang and N Sun ldquoAcceler-ating 2D FFT with non-power-of-two problem size on FPGArdquoin Proceedings of the 2010 International Conference on Recon-figurable Computing and FPGAs ReConFig 2010 pp 208ndash213Mexico December 2010

[28] F NattererTheMathematics of Computerized Tomography JohnWiley amp Sons 1986

[29] R A Crowther D J DeRosier and A Klug ldquoThe Reconstruc-tion of aThree-Dimensional Structure from Projections and itsApplication to Electron Microscopyrdquo Proceedings of the RoyalSociety A Mathematical Physical and Engineering Sciences vol317 no 1530 pp 319ndash340 1970

International Journal of Reconfigurable Computing 17

[30] U Skoglund L-G Ofverstedt R M Burnett and G BricogneldquoMaximum-entropy three-dimensional reconstruction withdeconvolution of the contrast transfer function A test applica-tion with adenovirusrdquo Journal of Structural Biology vol 117 no3 pp 173ndash188 1996

[31] F Mahmood N Shahid P Vandergheynst and U SkoglundldquoGraph-based sinogram denoising for tomographic reconstruc-tionsrdquo in Proceedings of the 38th Annual International Confer-ence of the IEEE Engineering in Medicine and Biology SocietyEMBC 2016 pp 3961ndash3964 USA August 2016

[32] F Mahmood N Shahid U Skoglund and P VandergheynstldquoAdaptive Graph-Based Total Variation for TomographicReconstructionsrdquo IEEE Signal Processing Letters vol 25 no 5pp 700ndash704 2018

[33] L A Shepp and B F Logan Jr ldquoThe Fourier reconstruction of ahead sectionrdquo IEEE Transactions on Nuclear Science vol 21 no3 pp 21ndash43 1974

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 4: Algorithm and Architecture Optimization for 2D Discrete ...downloads.hindawi.com/journals/ijrc/2018/1403181.pdf · Algorithm and Architecture Optimization for 2D Discrete Fourier

4 International Journal of Reconfigurable Computing

DRAM Module

Bank 1 Bank 2

Bank 3 Bank s

Bank 1 Bank 2

Bank 3 Bank s

Bank 1 Bank 2

Bank 3 Bank s

Rank 1Rank 0

Chip 0

Chip 1

Chip p

IO

Bus

(a) DRAM hierarchy

DRAM BANK

Column AddressDecoder

Data Array

IO

Row

Add

ress

Row Buffer

Dec

oder

Address Column Address

Row

Addr

ess

(b) Single DRAM bank

DecodeRow

RowPrecharge

ActivateRow

DecodeColumn

Row Buffer Miss

ExtraLatency

Row Buffer Hit

(c) Row buffer hit and miss

Figure 2 (a) An overview of the DRAM hierarchy (b) Image showing the structure of a single DRAM bank (c) Flow chart explainingadditional latency introduced when a new row has to be referred to in the row buffer to access a specific element

However both 119862푟 and 119862푤 can be concealed by interleaving(switching between banks) Since row-wise access is trivialthe row-by-row 1D FFT part of RCD-based 2D FFT is easilyaccomplished However once the row-by-row 1D FFT isstored in the DRAM in standard row-major order to accessa single column each row of the DRAM has to be accessedinto the row buffer rendering the read process extremelyinefficient This is typically the major bottleneck for high-throughput 2D FFTs (Figure 1(b)) We address this problemby designing a custommemory mapping scheme (Section 5)

23 Edge Artifacts While calculating 2D DFTs it is assumedthat the image is periodic which is usually not the caseThe nonperiodic nature of the image leads to artifacts in theFourier transform usually known as edge artifacts or seriestermination errors These artifacts appear as several crossesof high-amplitude coefficients in the frequency domain(Figure 3(b)) Such edge artifacts can be passed to subsequentstages of processing and in biomedical applications theymaylead to critical misinterpretations of results No current 2DFFT FPGA implementation addresses this problem directlyThese artifacts may be removed during preprocessing usingmirroring windowing zero padding or postprocessing eg

filtering techniques These techniques are usually computa-tionally intensive involve an increase in image size and alsotend to modify the transform

The most common approach is by ramping the imageat corner pixels to slowly attenuate the edges Ramping isusually accomplished by an apodization function such asa Tukey (tapered cosine) or a Hamming window whichsmoothly reduces the intensity to zero Such an approach canbe implemented on an FPGA as a preprocessing operationby storing the window function in a look-up table (LUT)and multiplying it with the image stream before calculatingthe FFT [16] Although this approach is not extremelycomputationally intensive for small images it inadvertentlyremoves necessary information from the image Loss of thisinformation may have serious consequences if the imageis being further processed with several other images toreconstruct a final image that is used for diagnostics or otherdecision-critical applications Another common method isby mirroring the image from 119873 times 119873 to 2119873 times 2119873 Doingso makes the image periodic and reduces edge artifactsHowever this not only increases the size of the image by 4timesbut also makes the transform symmetric which generates aninaccurate phase component

International Journal of Reconfigurable Computing 5

(a) (b)

(c) (d)

(e) (f)

Figure 3 (1a) An image with nonperiodic boundary (1b) 2D DFTof (1a) (1c) DFT of the smooth component ie the removedartifacts from (1a) (1d) Periodic component ie DFT of (1a) withedge artifacts removed (1e) Reconstructed smooth component (1f)Reconstructed periodic component

Simultaneously removing the edge artifacts while cal-culating a 2D FFT imposes an additional design challengeregardless of the method used However these artifacts mustbe removed in applications where they may be propagated tosubsequent processing levels An ideal method for removingthese artifacts should involve making the image periodicwhile removing minimal information from the image Peri-odic plus smooth decomposition (PSD) first presented byMoisan [4] and used in [23ndash25] is an ideal method forremoving edge artifacts (specifically for biomedical appli-cations) because it does not directly intervene with pixelsbeside those of the boundary and does not increase imagesizeMoreover its inherently parallel naturemakes it ideal fora high-throughput FPGA-based implementation We havefurther optimized the original PSD decomposition algorithmto make the overall implementation much more efficient by

decreasing the number of required 1D FFT invocations andby reducing external DRAM utilization (Section 4)

24 LabVIEW FPGA High-Level Design Environment Amajor concern while designing complex image processinghardware accelerators is how to fully harness on the divide-and-conquer approach Algorithms that have to be mappedto multiple FPGAs are often marred by communicationproblems and custom FPGA boards reduce flexibility forlarge-scale and evolving designs For rapid prototyping ofour algorithms we used LabVIEW FPGA 2016 (NationalInstruments) a robust data-flow-based graphical designenvironment LabVIEW FPGA provides integration withNational Instruments (NI) Xilinx-based reconfigurable hard-ware allowing efficient communication with a host PC andhigh-throughput communication between multiple FPGAsthrough a PXIe (PCI eXtensions for Industry Express) busLabVIEW FPGA also enables us to integrate external Hard-ware Description Language (HDL) code and gives us theflexibility to expand our design for future processing stagesWe used NI PXIe-7976R FPGA boards that have a XilinxKintex 7 FPGA and 2GB high-bandwidth (10GBs) externalmemory This platform has already been extensively used forrapid prototyping of communication standards and protocolsbefore moving to ASIC designs The optimizations anddesigns we present here are scalable to most reconfigurablecomputing-based systems Moreover LabVIEW FPGA pro-vides efficient high-level control over memory via a smartmemory controller

3 Periodic Plus Smooth Decomposition (PSD)for Edge Artifact Removal (EAR)

Periodic plus smooth decomposition (PSD) involves decom-posing the image into a periodic and smooth componentto remove edge artifacts with minimal loss of informationfrom the image [4] This section presents an overview ofthe PSD algorithm and profiles the algorithm for possibleparallelization and optimization to achieve efficient FPGAimplementation

Let us have discrete 119899 by 119898 gray-scale image 119868 on a finitedomain Ω = 0 1 119899 minus 1 times 0 1 119898 minus 1 The discreteFourier transform (DFT) of 119868 is defined as

(119904 119905) = sum(푖푗)isinΩ

119868 (119894 119895) exp(minus1205802120587 (119904119894119899 + 119905119895119898)) (1)

This is equivalent to a matrix multiplication119882119868119881 where

119882 = (((((((

1 1 1 11 119908 1199082 119908푛minus11 1199082 1199084 1199082(푛minus1) 1 119908푛minus2 1199082(푛minus2) 119908(푛minus2)(푛minus1)1 119908푛minus1 1199082(푛minus1) 119908(푛minus1)(푛minus1)

)))))))

(2)

6 International Journal of Reconfigurable Computing

Step B Step C Step D

Step EStep A

DMAFIFODMA

FIFO

DMAFIFO

IO

Host PCFPGA 1

FPGA 2

SmartImagingDevice Host PC

2D FFT(RCD with DRAM

IntermediateStorage)

2D FFT(RCD with DRAM

IntermediateStorage)

ComputePeriodicBoarder

ComputeSmooth

Component

ComputePeriodic

Component

Figure 4 A top-level architecture for OPSD using two FPGAs and a host PC connected over a high-bandwidth bus The steps are associatedwith Algorithm 1

and

119908푘 = exp(minus1198942120587119899 )푘 = exp(minus1198942120587119896119899 ) (3)

119881 has the same structure as119882 but is m-dimensional 119908푘 hasperiod 119899 which means that 119908푘 = 119908푘+푙푛 forall119896 119897 isin N therefore

119882 = ((((((

1 1 1 1 11 119908 1199082 119908푛minus2 119908푛minus11 1199082 1199084 119908푛minus4 119908푛minus2 1 119908푛minus2 119908푛minus4 1199084 11990821 119908푛minus1 119908푛minus2 1199082 1199081))))))

(4)

Since in general 119868 is not (119899119898)-periodic there will behigh-amplitude edge artifacts present in the DFT stemmingfrom sharp discontinuities between the opposing edges ofthe image as shown in Figure 3(b) Reference [4] proposeda decomposition of 119868 into a periodic component 119875 whichis periodic and captures the essence of the image with allhigh-frequency details and a smoothly varying background119878 which recreates the discontinuities at the borders so 119868 =119875+ 119878 Periodic plus smooth decomposition can be computedby first constructing a border image 119861 = 119877 + 119862 where 119877represents the boundary discontinuities when transitioningrow-wise and 119862when going column-wise

119877 (119894 119895)= 119868 (119899 minus 1 minus 119894 119895) minus 119868 (119894 119895) 119894 = 0 or 119894 = 119899 minus 10 otherwise

119862 (119894 119895)= 119868 (119894 119898 minus 1 minus 119895) minus 119868 (119894 119895) 119895 = 0 or 119895 = 119898 minus 10 otherwise

(5)

It is obvious that the structure of the border image119861 is simplewith nonzero values only in the edges as shown below

119861 = 119877 + 119862= (((

11988711 11988712 1198871푚minus1 1198871푚11988721 0 0 minus11988721 119887푛minus11 0 0 minus119887푛minus11119887푛1 minus11988712 minus1198871푚minus1 minus119887푛푚)))

(6)

The DFT of the smooth component 119878 can be then found bythe following formula

(119904 119905) = (119904 119905)2 cos (2120587119904119899) + 2 cos (2120587119905119898) minus 4 forall (119904 119905) isin Ω (0 0) (7)

The DFT of the image 119868 with edge artifacts removed isthen = minus Figures 3(c) and 3(d) show the DFT ofthe smooth and periodic components respectively Figures3(e) and 3(f) show the reconstructed periodic and smoothcomponents On reconstruction it is evident that there isnegligible visual difference between the actual image and theperiodic reconstructed image for this example

31 Profiling PSD for FPGA Implementation Algorithm 1summarizes the overall PSD implementation There areseveral ways of arranging the algorithm We have arrangedit so that DFTs of the periodic and smooth componentsare readily available for further processing stages For bestresults both the periodic and smooth components shouldundergo similar processing stages and should be added backtogether before displaying the result However dependingon the application it might be acceptable to discard theperiodic component completely For an 119899 times 119898 image stepsA and C have a complexity of O(119899119898 log(119899119898)) and steps Band D have complexity O(119898 + 119899) and O(119898119899) respectivelyComputationally the performance of PSD is limited by stepsA and C Figure 4 shows a proposed top-level architecture

International Journal of Reconfigurable Computing 7

where step A and steps B C and D are completed on separateFPGAs while step E can be done on the host PC Two highend FPGAs are used instead of one because resources onone FPGA are insufficient to compute a large size 2D FFTas well its edge artifact removal components The overallperformance may be limited by FPGA 2 where most of theserial part of the algorithm lies There are two major factorswhich limit the throughput of such a design

(1) While FPGA 1 and FPGA 2 can run in parallel theresult of step A from FPGA 1 has to be stored on thehost PC while steps B C and D are completed onFPGA 2 before step E can be completed on the hostPC

(2) The DRAM intermediate storage problem explainedin Section 22 and Figures 1 and 2 has to be addressedsince strided access to the DRAM for column-wiseoperations can significantly limit throughput

As for (1) it has been addressed in the next section wherewe make use of the inherent symmetry of the boundaryimage to reduce the time required to compute the 2D FFTof the boundary image As for (2) it has been addressed bydesigning a semi-custommemory mapping controller whichrdquotilesrdquo the DRAM floor and rdquohopsrdquo between several tiles so asto minimize strided memory access

4 Proposed Optimized Periodic Plus SmoothDecomposition (OPSD)

In this section we optimize the original PSD algorithmThisoptimization is to effectively reduce the number of 1D FFTinvocations and the number of times the DRAM is accessed

Equation (6) shows that except the corners the boundaryimage 119861 is symmetrical in the sense that boundary rows andcolumns are algebraic negation of each other In total119861has 119899+119898 minus 1 unique elements with the following relations betweencorners with respect to columns and rows

11988711 = 11990311 + 119888111198871푚 = 1199031푚 minus 11988811119887푛1 = minus11990311 + 119888푛1119887푛푚 = minus1199031푚 minus 119888푛1(8)

997904rArr 119887푛푚 = minus11988711 minus 1198871푚 minus 119887푛1 (9)

In computing the FFTof119861 one normally proceeds by firstrunning 1D FFTs column-by-column and then 1D FFTs row-by-row or vice versa An FFT of a column vector 119907with length119899 is119882119907 where119882 is given in (4)The column-wise FFT of thematrix 119861 is then

=119882119861 (10)

Let us have a closer look at the first column denoted by 119861sdot1The 1D FFT of this vector is

sdot1 =119882119861sdot1

= (((((((

1 1 11 119908 119908푛minus11 1199082 1199082(푛minus1) 1 119908푛minus2 119908(푛minus2)(푛minus1)1 119908푛minus1 119908(푛minus1)(푛minus1)

)))))))

((((((

119887111198872111988731 119887푛minus11119887푛1

))))))

=((((((((((((((((

푛sum푖=1

119887푖1푛sum푖=1

119887푖1119908푖minus1푛sum푖=1

119887푖11199082(푖minus1) 푛sum푖=1

119887푖1119908(푛minus2)(푖minus1)푛sum푖=1

119887푖1119908(푛minus1)(푖minus1)

))))))))))))))))

(11)

It can be shown that the 1D FFT of the column 119895 isin2 3 119898 minus 1 issdot119895 =119882119861sdot119895

= (((((((

1 1 11 119908 119908푛minus11 1199082 1199082(푛minus1) 1 119908푛minus2 119908(푛minus2)(푛minus1)1 119908푛minus1 119908(푛minus1)(푛minus1)

)))))))

((((((

1198871푗00 0minus1198871푗

))))))

= 1198871푗((((((

01 minus 119908푛minus11 minus 119908푛minus2 1 minus 11990821 minus 119908

))))))

= 1198871푗^

(12)

8 International Journal of Reconfigurable Computing

Input 119868(119894 119895) of size 119899 times 119898Output (119904 119905) (119904 119905)

Step A Compute the 2D DFT of image 119868(119894 119895)1 119868(119894 119895) F997888rarr (119904 119905)Step B Compute periodic border 119861

2 while 1 lt 119895 lt 119898 do3 while 1 lt 119894 lt 119899 do4 if (119894 = 0 or 119894 = 119899 minus 1) then5 119877(119894 119895) larr997888 119868(119899 minus 1 minus 119894 119895) minus 119868(119894 119895)6 else7 119877(119894 119895) larr997888 08 end if9 if (119895 = 0 or 119895 = 119898 minus 1) then10 119862(119894 119895) larr997888 119868(119894 119898 minus 1 minus 119895) minus 119868(119894 119895)11 else12 119862(119894 119895) larr997888 013 end if14 end while15 end while16 119861larr997888 119877 + 119862Step C Compute the 2D DFT of (119904 119905)

17 119861(119894 119895) F997888rarr (119904 119905)Step D Compute the Smooth Component (119904 119905)

18 (119904 119905) larr997888 2 cos(2120587119904119899) + 2 cos(2120587119905119898) minus 419 (119904 119905) larr997888 (119904 119905) divide (119904 119905)

Step E Compute the Periodic Component (119904 119905)20 (119904 119905) larr997888 (119904 119905) minus (119904 119905)21 return (119904 119905) (119904 119905)

Algorithm 1 Periodic plus smooth decomposition (PSD)

and the 1D FFT of the last column 119861sdot119898 is

sdot119898 =119882119861sdot119898 (13)

=((((((((((((((((

minus 푛sum푖=1

119887푖1minus 푛sum푖=1

119887푖1119908푖minus1 + (11988711 + 1198871푚) (1 minus 119908푛minus1)minus 푛sum푖=1

119887푖11199082(푖minus1) + (11988711 + 1198871푚) (1 minus 1199082(푛minus1)) minus 푛sum푖=1

119887푖1119908(푛minus2)(푖minus1) + (11988711 + 1198871푚) (1 minus 119908(푛minus2)(푛minus1))minus 푛sum푖=1

119887푖1119908(푛minus1)(푖minus1) + (11988711 + 1198871푚) (1 minus 119908(푛minus1)(푛minus1))

))))))))))))))))

(14)

sdot119898 = minussdot1 + (11988711 + 1198871푚) ^ (15)

Therefore the column-wise FFT of the matrix 119861 is

= (119861sdot1 11988712^ 1198871푚minus1^ minussdot1 + (11988711 + 1198871푚) ^) (16)

To compute the column-by-column 1DFFT of thematrix119861 we only have to compute the FFT of the first vectorand then use the appropriately scaled vector ^ to derivethe remainder of the columns The row-by-row FFT has

to be calculated in a row burst normal way Algorithm 2presents a summary of the shortcut for calculating (119904 119905)The steps presented in Algorithm 2 can replace step Cin Algorithm 1 By reducing column-by-column 1D FFTcomputations for the boundary image this method cansignificantly reduce the number of 1D FFT invocationsreduce the overall DRAM access and eliminate problematiccolumn-wise strided DRAM access for an efficient FPGA-based implementation For column-wise operations a single1D FFT of size 119898 is required rather than 119899119898 1D FFTs of size119898 Moreover since one has to simply store one column ofdata it can be stored on the on-chip local memory (BRAMor SRAM) This can be implemented by temporarily storingthe initial vector sdot1 and scaling factors 1198871푗 in the blockRAMregister memory drastically reducing DRAM accessand lowering the number of required 1D FFT invocationsPerformance evaluation for this has been presented in theresults section

Table 1 shows a comparison of mirroring PSD andour proposed OPSD with respect to DRAM access pointsMirroring has been used for comparison purposes because itis an alternative technique that reduces edge artifacts whilemaintaining maximum amplitude information Howeverdue to replication of the imagemost of the phase informationis lost Figure 5 graphically shows that our OPSDmethod sig-nificantly reduces reading fromexternalmemory and reduces

International Journal of Reconfigurable Computing 9

Input 119861(119894 119895) of size 119899 times 119898Output (119904 119905)2DF(119861) lArrrArr 119861 F

119888119908997888997888997888rarr 119888119908 F119903119908997888997888997888rarr

Column-by-Column DFT via Symmetrical Short-cut1 119861sdot1

F997888rarr sdot12 while 1 lt 119895 lt 119898 do3 sdot119895 larr997888 1198871푗^4 end while5 sdot119898 larr997888 minussdot1 + (11988711 + 1198871푚)^6 119888119908 larr997888 119862119900119899119888119886119905119890119899119886119905119890 sdot1 sdot2 sdot119898minus1 sdot119898

Row-by-Row DFT7 119888119908

F119903119908997888997888997888rarr (119904 119905)

8 return (119904 119905)cw column-wisecolumn-by-columnrw row-wiserow-by-row

Algorithm 2 Proposed symmetrically optimized computation of (119904 119905)

DRA

M A

cces

s Poi

nts

2

18

16

14

12

1

08

06

04

02

0

Image Size (N=M)0 1000 2000 3000 4000 5000

DRAM Access - PSD vs OPSD

MirroringPSDOPSD (Proposed)

times108

Figure 5 Graph showing DRAM access (equal to number of DFT points to be computed) with increasing image size for mirroring periodicplus smooth decomposition (PSD) and our proposed optimized period plus smooth decomposition (OPSD)

the overall number of DFT computations required It shouldbe noted that such optimization is only possible for eithercolumn-wise or row-wise operations because after either ofthese operations the output is not symmetrical anymoreCompleting the column-wise operation first prevents stridedreading however this results in strided writing to the DRAMbefore row-wise traversal can startThis can be minimized bymaking efficient use of the local block RAM Output columnsare stored in the block RAM before being written to theDRAM in patches such that each row buffer access writeselements in several columns

5 Proposed Tile-Hopping Memory Mappingfor 2D FFTs

In this section we propose a tile-hopping external memoryaccess pattern for efficiently addressing external memory

during intermediate storage between row-wise and column-wise 1D FFT operations to calculate a 2D FFT As explainedin Section 22 and Figure 2 column-wise reads from DRAMcan be costly due to the overhead associated with activatingand precharging In the worst case scenario it can limitDRAM bandwidth up to 80 [26] This is a problem withall such image processing operations where one stage of theprocessing has to be completed on all elements before thenext stage can start In the past there have been severalimplementations using localmemory however with growingdemand for larger image sizes external memory has to beused There have been several DRAM remapping attemptsbefore such as [5 13] They propose a tile-based approachwhere an 119899 times 119899 image (input array) is divided into 119899119896 times 119899119896tiles where 119896 is the size of the DRAM row buffer whichallows for very high-bandwidthDRAMaccess Although this

10 International Journal of Reconfigurable Computing

Table 1 Comparing mirroring PSD and OPSD

Algorithm DRAM Access DFTPoints Points

Mirroring 8119873119872 8119873119872P + S Decomposition (PSD) 4119873119872 4119873119872Optimized PSD (Proposed) 3119873119872 +119873 +119872 minus 1 3119873119872 +119872method may be ideal to maximize the DRAM performancefor 2D FFTs it incurs a high resource cost associated withlocal memory transposition and storing large chunks of data(entire rowcolumn of tiles) in the local memory Moreovertiling in the image domain also requires remapping row-by-row operations Another approach to reducing stridedDRAM access has been presented in [27] They present a2D decomposition algorithmwhich decomposes the probleminto smaller subblock 2D FFTs which can be performedlocallyThis introduces extra row and columndata exchangesand total number of operations are increased fromO(1198992 log 119899)toO(1198992(1+log 119899)) Other implementations do not address theexternal memory issue in detail

We propose tile-hopping address mapping which reducesthe number of row activations required to access a singlecolumn Unlike [5] our approach does not require significantlocal operations or storage The proposed memory mappingcontroller was designed on top of LabVIEW FPGArsquos existingmemory controller which efficiently controls interleaving andissues activation and precharge commands in parallel withdata transfer The reduced number of row activations alsoreduces the amount of energy required by the DRAM Thiswill be further discussed in the energy evaluation presentedin Section 64 and Table 3 where we demonstrate that theproposed tile-hopping address mapping can reduce energyconsumption up to 53

Instead of writing the results of the row-by-row 1D FFTin row-major order we remap the results in a blocked or tiledpattern as shown in Figure 6This means that when accessingan image column several elements of that column can beretrieved from a single DRAM row access For an 119899times119899 imageeach row of size 119899 can be divided into ℎ tiles (ie 119899 = ℎ119873(119905)where119873(119905) is the number of elements in each tile)These tilescan be remapped onto the DRAM floor as shown in Figure 6If the size of the row is small enough it may be possible toconvert it into a single tile (ie ℎ = 1) However this isunlikely for realistic image sizes For a tile of size119901times119902 a singlerow of the image is written into the DRAM by transitioningthrough 119901ℎ rows If 119896 is the size of the row buffer thereare 119896119902 distinct tiles represented in each DRAM row and itcontains the same number of elements from a single imagecolumn Given regular row-major storage when accessingcolumn-wise elements one would have to transition through119899 DRAM rows to read a single image column Howeverwith this approach when accessing an image column 119896119902elements of that column could be read from a single DRAMrow which has been referred to in the row buffer Althoughthe cost of writing an image row is higher when compared toa standard row-major DRAM writing pattern (ie referring

to119901ℎ rather than 119899 rows) the number ofDRAMrow referralsduring column-wise read is reduced to 119899119902119896 which is 119899(1 minus119902119896) less row referrals for a single column read

We refer to this method as tile-hopping because it entailsmapping data onto several DRAM tiles and then hoppingbetween the tiles such that several elements of the imagecolumn exist in a DRAM row which has been referred toin the row bank Although this mapping scheme has beendeveloped for column-wise access required during 2D FFTcalculation the scheme is general and can be adapted to otherapplications Performance evaluation of thismethod has beenpresented in the experiments section

6 Experimental Results and Analysis

61 Hardware Configuration and Target Selection Since 2DDFTs are usually used for simplifying convolution operationsin complex image processing andmachine vision systems weneeded to prototype our design on a system that is expandablefor next levels of processing As mentioned earlier for rapidprototyping of our proposed OPSD algorithm and tile-hopping memory mapping scheme we used a PXIe-basedreconfigurable system PXIe is an industrial extension ofa PCI system with an enhanced bus structure that giveseach connected device dedicated access to the bus with amaximum throughput of 24GBs This allows a high-speeddedicated link between a host PC and several FPGAs TheLabVIEWFPGAgraphical design environment is efficient forrapid prototyping of complicated signal and image processingsystems It allows us to effectively integrate external HDLcode and LabVIEW graphical design on a single platformMoreover it allows a combination of high-level synthesis(HLS) and custom logic Since current HLS tools havelimitations when it comes to complex image and signalprocessing tasks LabVIEW FPGA tries to bridge these gapsby streamlining the design process

We used FlexRIO (Flexible Reconfigurable IO) FPGAboards plugged into a PXIe chassis PXIe FlexRIO FPGAboards are adaptable and can be used to achieve highthroughput because they allow direct data transfer betweenmultiple FPGAs at rates as high as 8GBs This can signifi-cantly simplify multi-FPGA systems which usually commu-nicate via a host PC This feature allows expansion of oursystem to further processing stages making it flexible for avariety of applications Figure 7 shows a basic overview ofa PXIe-based multi-FPGA system with a host PC controllerconnected through a high-speed bus on a PXIe chassis

Specifically we used two NI PXIe-7976R FlexRIO boardswhich have Kintex 7 FPGA and 2GB external DRAM withtheoretical data bandwidth up to 10GBs This FPGA boardwas plugged into a PXIe-1085 chassis along with a PXIe-8880Intel Xeon PC controller PXIe-1085 can hold up to 16 FPGAsand has 8 GBs per-slot dedicated bandwidth and an overallsystem bandwidth of 24 GBs

62 Experimental Setup As per Algorithms 1 and 2 dis-cussed in previous sections implementation involves fivestages (A) calculating the 2D FFT of an image frame (B)

International Journal of Reconfigurable Computing 11

n

n

Row-by-Row (RR) Output

(a) Dataset view

k

p

q

Writing RR Output to DRAM

Row Buffer

(b) Memory view

k

Reading from DRAM

Row Buffer

(c) Memory view

Figure 6 Image showing tile-hopping (a) Image-level view showing tiles (b) DRAM-level view showing tile placement while writing (c)DRAM-level view showing column reading from the tiles

External IO PXIe

Host PC ControllerDisplay

Device

PXIeFPGA Board 1

High-throughput PXIe BUSPXIe Chassis

External IO

PXIeFPGA Board 2

External IO

PXIeFPGA Board 3

External IO

Figure 7 Block diagram of a PXIe-based multi-FPGA system witha host PC controller connected through a high-speed bus on a PXIechassis [8]

calculating the boundary image (C) calculating the 2DFFT of the boundary image (D) calculating the smoothcomponent and (E) subtracting the smooth component fromthe 2D FFT of the original image to achieve the periodiccomponent The bottleneck consistently occurs in A and theserial part of the algorithm (A 997888rarr B 997888rarr C) The limitation

due to A is reduced removing the so-called ldquomemory wallrdquoby using our proposed tile-hopping-based memory mappingThe limitations due to the serial part of the algorithm arereduced by using OPSD rather than PSD For quantificationthe delay for A is 062ms for a 512 times 512 image

The design flow presented in Figure 4 was followedData-flow is clearly shown in a graphical programmingenvironment making it easier to visualize how a designefficiently fits on an FPGA Highly efficient implementationsof 1D FFT were used from LabVIEW FPGA for parallelrow-by-row operations and by integrating Xilinx LogiCOREfor column-by-column operations Each stage of the designwas dynamically tested and benchmarked The image wasstreamed from the host PC using a Direct Memory Access(DMA) FIFO 1D FFTs are performed in parallel rows of 8and stored in the DRAM via local memory in a tiled patternas explained in the previous section This follows readingseveral rows to extract a single column which is Fourier-transformed using Xilinx LogiCORE and is sent back to

12 International Journal of Reconfigurable Computing

I(00)

I(n-1 m-1)

i

j

N x M size image

a b

a

1D FFT

c

Ori

gina

l Im

age

Boun

dary

Imag

e

TrigonometricFunctions

Equation (7)

FFT ofSmooth Component

FFT of Periodic Component(With Edge Artifact

Removed)

b

Shortcut for Column-wise 1D FFT

Exte

rnal

Mem

ory

Exte

rnal

Mem

ory

Row-wise 1D FFT

c

Bloc

k M

emor

y

Exte

rnal

Mem

ory

Row-wise 1D FFT

Host PC Controller(NI PXIe-8880)

FPGA-1(NI PXIe-7976R)

Host PC Controller (NI PXIe-8880)

PXIe Chassis (NI-PXIe1085)

FPGA-2(NI PXIe-7976R)Block Memory

Bloc

k M

emor

y

+

minusColumn-wise 1D FFTlowast

Row-Major Memory Mapping Tile-Hopping Memory Mapping

I(n-1 j) ndash I(0 j)

I(i

0) ndash

I(i

m-1

)

I(i

m-1

) ndash I(

i 0)

I(0 j) ndash I(n-1 j)

middot1

b1jv

-middot1 + (b11+b1m)v

Figure 8 Functional block diagramof PXIe-based 2DFFT implementationwith simultaneous edge artifact removal using optimized periodicplus smooth decomposition The OPSD algorithm is split among two NI-7976R (Kintex-7) FPGA boards with 2GB external memory and ahost PC connected over a high-bandwidth bus The image is streamed from the PC controller to FPGA 1 and FPGA 2 FPGA 1 calculates therow-by-row 1D FFT followed by column-by-column 1D FFT with intermediate tile-hopping memory mapping and sends the result back tothe host PC FPGA 2 receives the image calculates the boundary image and proceeds to calculate the 1D FFT column-by-column FFT usingthe shortcut presented in (16) followed by row-by-row 1D FFTs and the result is sent back to the host PC

Camera

HostPC

External Memory

FPGA Board 1

FPGA

DMAFIFO

FIFO IN(Local Memory - (Local Memory ndash

Write)

FIFO OUT

Read)

PXIe

BU

S

CU

1D FF4H

1D FF41

1D FF42

Figure 9 Block diagram of 2D FFT showing data transfer betweenexternal memory and local memory scheduled via a Control Unit(CU)

the host PC If the image is being streamed directly froman imaging device which scans and provides random ornonlinear sequence of rows it is necessary to store a frameof the image in a buffer This can also be accomplishedby streaming the image flow from the host PC or using asmart camera which can delay image delivery by a singleframe Local memory shown in Figure 9 is used to bufferdata between external memory and 1D FFT cores Thismemory is divided into read and write components and isimplemented using FPGA slices Block RAM (BRAM) is usedfor temporary storage of vectors required for calculating the2D FFT of the boundary image (in the case of FPGA 2)The Control Unit (CU) organizes scheduling for transferring

data between local and external memory CU is based onLabVIEWrsquos existing memory controller and our memorymapping scheme presented in Section 5

Step B was accomplished using standard LabVIEWFPGA HLS tools for programming (5) using the graphicalprogramming environment In step C the 2D FFT of theboundary image needs to be calculated by row and columndecomposition However as shown mathematically in theprevious section the initial row-wise FFTs can be calculatedby computing the 1D FFT of the first (boundary) vector andthe FFTs of reaming vectors can be computed by appropriatescaling of this vector

We need the boundary column vector for 1D FFT calcula-tion of the first and last columns We also need the boundaryrow vector for appropriate scaling of for the 1D FFT ofevery column between the first and last columns Row andcolumn vectors of the boundary image are stored in blockRAM (BRAM) Figure 8 shows a functional block diagramof the overall 2D FFT with optimized PSD process Steps Dand E are performed on the host PC to minimize memoryclashes and to access the periodic and smooth componentsof each frame as they become available 86 of resources areused on FPGA 1 and 41 resources are used on FPGA 2 Theresource utilization is reported according to LabVIEWFPGAsynthesis and compilation experiments It should be notedthat part of the reason for high resource utilization is becauseof using LabVIEW FPGA high-level synthesis tools as wellas Xilinx LogiCORE tools Using standardized tools makesit harder to optimize for resource utilization The currentimplementation was optimized for performance in terms ofrun time and energy consumption

International Journal of Reconfigurable Computing 13

Fram

esS

econ

d (1

s)

105

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7)Kintex-7 (No DRAM Opt)Kintex-7 (DRAM Opt)Theoretical Peak

(a) 2D FFT performance DRAM tile-hopping

Fram

esS

econ

d (1

s)

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7) (PSD)Kintex-7 (PSD)Kintex-7 (OPSD)

(b) 2D FFT with EAR performance PSD versus OPSD

Figure 10 Performance evaluation in terms of frames per second for (a) 2D FFTs with tile-hopping memory pattern (b) 2D FFTs with edgeartifact removal (EAR) using OPSDThe performance evaluation shows the significance of the two optimizations proposed Both axes are ona log scale

63 Performance Evaluation The overall performance of thesystem was evaluated using the setup presented in Figure 9The data was streamed from the host PC in certain caseshigh frame rate videos as well as direct camera input werestreamed from the host All results presented are for 16-bitfixed-point precision Figure 10(a) presents the effectivenessof our proposed tile-hopping memory mapping scheme Itclearly shows the effectiveness of our proposed memorymapping since it is closer to the theoretical peak performanceFigure 10(b) presents the overall results comparing PSD andOPSD and demonstrating the effectiveness of our proposedoptimization PSD was also implemented on the same plat-form but the optimization presented in Section 4 was notused This rendered the serial portion of the algorithm to bethe bottleneck which reduced overall performance Table 2shows a comparison of our implementation in contrast torecent 2D FFT FPGA implementation approaches and showsthat we achieve a better performance even with simultaneousedge artifact removal Although our implementation is testedwith 16-bit fixed-point precision which limits the accuracy ofthe transform the precision may be sufficient for a varietyof speed critical applications where alternative edge artifactremoval methods (eg filtering) may decrease overall systemperformance Although the dynamic range of fixed-pointdata is smaller than floating-point data which can lead toerrors in the 2D FFT for certain applications it is moreimportant to have an artifact-free transform as compared toa highly accurate transform

64 Energy Evaluation The overall energy consumptionof the custom computing system depends on (1) powerperformance of the system components and (2) throughputdisparity Throughput disparity results in idle time for at

least one of the components and lowers the overall sys-tem throughput A throughput-optimized system minimizesinstances where certain components of the architecture areidle The proposed optimizations in Sections 4 and 5 clearlyreduce throughput disparity andminimize the idle time of thesystem Thus besides causing delays due to significant over-head standard column-wise DRAM access also contributesto the overall energy consumption This is not only due tohigh count of DRAM row charges but also because of energyconsumed by the FPGA in idle state Ideally maximizing theDRAMbandwidth limits the amount of energy consumptionProposed ldquotile-hoppingrdquo memory mapping scheme improvesthe DRAMbandwidth as seen in Figure 10 and hence reducesthe overall energy consumption The same is true for theproposed OPSD method where reduced DRAM access and1D FFT invocations lead to reduced energy consumption Inthis section we analyze the amount of improvement in energyconsumption based on the proposed optimizations

We estimate the DRAM power consumption for boththe baseline (standard strided) and the optimized (ldquotile-hoppingrdquo) memory access using MICRON DRAM powercalculator The energy is calculated in 119899119869 for each readie energy per read This is accomplished by calculatingthe run time for a specific 2D FFT and estimating theamount of energy consumed by the DRAM power calculatorTable 4 depicts the DRAM energy consumption for 2D FFTsbefore and after the proposed ldquotile-hoppingrdquo optimization forcolumn-wise DRAM access As mentioned earlier row-wiseDRAM access is fast and row buffer size data can be accessedby a single row activation According to Table 4 the energyrequired for DRAM access is reduced by 427 488 and529 for 1024 times 1024 2048 times 2048 and 4096 times 4096 size 2DFFTs respectively

14 International Journal of Reconfigurable Computing

Table 2 Comparison of OPSD1 2D FFT with regular RCD-based implementation

Platform SEAR2 Precision RT3 (ms)YesNo bits 512 times 512 1024 times 1024

Kintex 7 28nm (ours) Yes 16 (fixed) 15 48Kintex 7 28nm (ours) No 16 (fixed) 09 41Kintex 7 28mm [8] Yes 16 (fixed) 324 1167Stratix IV [5] No 64 (double) - 61Virtex-5-BEE3 65nm[14] No 32 (single) 249 1026Virtex-E 180nm [21] No 16 (fixed) 286 769ASIC 180nm No 32 (single) 210 -1 Optimized periodic + smooth decomposition (OPSD)2 Simultaneous edge artifact removal3 Runtime (ms)

Table 3 DRAM energy consumption baseline vs tile-hopping1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPRlowast CW∘ Read(Baseline) 446 577 712EPR CW ReadTile-Hopping 254 295 336(Proposed)Reduction () 427 488 529lowastEnergy per read (EPR)∘Column-wise memory access (CW)

Table 4 2D FFT + EAR energy consumption baseline vs optimized (OPSD + tile hopping)1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPPdagger 2D FFT + EARBaseline 3692 4125 48352D FFT+EAR (Opt)Tile-Hopping + OPSD 1588 1706 1811(Proposed)Improvement 23times 24times 28timesdaggerEnergy per point (EPP)

The metric used to compare the overall energy optimiza-tion achieved for 2D FFTs with EAR is energy per point iethe amount of average energy required to compute the 2DFFT of a single point in an image with simultaneous edgeartifact removal This was achieved by calculating the energyconsumed by Xilinx LogiCORE IP for 1D FFTs the DRAMand the edge artifact removal part separately The estimatedenergy calculated does not include energy consumed by thePXIe chassis and the host PC Essentially the FPGA-basedarchitecture presented here could be used without the hostcontroller The energy consumption incorporates dynamicas well as static power The overall energy consumption perpoint is reduced by 569 586 and 62 for calculating1024 times 1024 2048 times 2048 and 4096 times 4096 size 2D FFTs withEAR respectively

7 Application Filtered Back-Projectionfor Tomography

In order to further demonstrate the effectiveness of ourimplementation we use the created 2D FFT module asan accelerator for reducing the run time for filtered back-projection (FBP) FBP is a fundamental analytical tomo-graphic image reconstruction method In depth detailsregarding the basic FBP algorithm have been left out forbrevity but can be found in [28 29]Themethod can be usedto reconstruct primitive 3D tomograms from 2D data whichcan then be used as a basis for more complex regularization-basedmethods such as [30ndash32]The algorithmic flow is basedon the Fourier slice theorem ie 2D Fourier transformsof projections are an angular component of the 3D Fourier

International Journal of Reconfigurable Computing 15

Table 5 Comparing filtered back-projection (FBP) runtime (as an application for using the proposed 2D FFT with simultaneous EAR)

3D Density CPU (i7) FPGA + Host PC (i7)Sec Sec128 times 128 times 128 213 sec 195 sec256 times 256 times 256 475 sec 424 sec512 times 512 times 512 948 sec 813 sec1024 times 1024 times 1024 3223 sec 2753 sec2048 times 2048 times 2048 16877 sec 13644 sec4096 times 4096 times 4096 164631 sec 125994 sec

Original Phantom CPU Reconstructed Phantom FPGA+CPU Reconstructed Ph

Inte

nsity

Inte

nsity

1

08

06

04

02

0

1

08

06

04

02

0

X Direction0 20 40 60 80 100 120

X Direction0 20 40 60 80 100 120

Original PhantomCPU BP

Original PhantomFPGA+CPU BP

Figure 11 Figure showing a thin slice of filtered back-projectionresults by reconstructing a 128 times 128 times 128 Shepp-Logan phantomThe 3Ddensitywas reconstructed from 180 equally spaced simulatedprojections using standard linearly interpolated FBP and using aRam-Lak filter It can be seen that the results from the FPGA + CPUsolution have some errors this is due to the fact that the 2D FFT ofeach projection is less accurate The results are good enough to beused as a basis for further optimization based refinement methods

transform of the 3D reconstructed volume Our 2D FFTaccelerator was used to calculate the 2D FFTs of the projec-tions as well as for initial stages of the 3D FFTwhich was thencompleted on the host PC Similar to the 2D FFT the 3D FFTis separable and can be divided into 2D FFTs and 1D FFTsThe results have been shown in Table 5 It can be seen thatthe improvement for smaller size densities is not significantbecause their FFTs are quite fast on general-purpose CPUsHowever for larger densities the FFT accelerator can give asignificant improvement If the remaining components arealso implemented on an FPGA significant speed increasecan be achieved Results of a thin slice from a 3D simulatedShepp-Logan [33] phantom have been shown in Figure 11 Itcan be seen that the results from the hardware acceleratedFBP are of slightly lower quality This is due to the factthat our 2D FFT implementation is less accurate (16 bitfixed-point) as compared to the CPU-based implementation(FFTW double-precision floating) The accelerated FBP wasalso tested with real Electron Tomography (ET) data

8 Conclusion

2D FFTs often become a major bottleneck for high-performance imaging and vision systems The inherentcomputational complexity of the 2D FFT kernel is fur-ther enhanced if effective removal (using PSD) of spuriousartifacts introduced by the nonperiodic nature of real-lifeimages is taken into accountWe developed and implementedan FPGA-based design for calculating high-throughput 2DDFTs with simultaneous edge artifact removal Our approachis based on a PSD algorithm that splits the frequency domainof a 2D image into a smooth component which contains thehigh-frequency cross-shaped artifacts and can be subtractedfrom the 2D DFT of the original image to obtain a periodiccomponent that is artifact-free Since this approach calculatestwo 2D DFTs simultaneously external memory addressingand repeated 1D FFT invocations become problematic Tosolve this problem we optimized the original PSD algorithmto reduce the number of DFT samples to be computed andDRAM access Moreover to reduce strided access from theDRAMduring column-wise reads we presented and analyzedldquotile-hoppingrdquo a memorymapping scheme which reduces thenumber of DRAM row activations when reading a singlecolumn of data This memory mapping scheme is generaland may be used for a variety of other applications Wedemonstrate that ldquotile-hoppingrdquomemorymapping can reducethe DRAM energy consumption by 529 Moreover weshow that the proposed optimizations lead to 28times less energyconsumption for the overall 2D FFT with EAR architecture

Our methods were tested using extensive synthesis andbenchmarking using a Xilinx Kintex 7 FPGA communicatingwith a host PC on a high-speed PXIe bus Our system isexpandable to support several FPGAs and can be adaptedto various large-scale computer vision and biomedical appli-cations Despite decomposing the image into periodic andsmooth frequency components our design requires lessrun time compared to traditional FPGA-based 2D DFTimplementation approaches and can be used for a varietyof highly demanding applications One such applicationfiltered back-projection was accelerated using the proposedimplementation to achieve better results specifically for largersize raw tomographic data

Conflicts of Interest

The authors declare that they have no conflicts of interest

16 International Journal of Reconfigurable Computing

Acknowledgments

Thisworkwas supported by JapaneseGovernmentOIST Sub-sidy for Operations (Ulf Skoglund) Grant no 5020S7010020FaisalMahmood andMart Toots were additionally supportedby the OIST PhD Fellowship The authors would like tothank National Instruments Research for their technicalsupport during the design process The authors would alsolike to thank Dr Steven D Aird for language assistance andShizuka Kuda for logistical arrangements

Supplementary Materials

File 2D FFT-EAR-Demomp4 video demonstrating theFPGA output of 2D FFT with edge artifact removal(Supplementary Materials)

References

[1] R N Bracewell The fourier transform and iis applications vol5 New York NY USA 1965

[2] Theory and application of digital signal processing vol 1Prentice-Hall Inc Englewood Cliffs NJ USA 1975

[3] R A Brooks and G Di Chiro ldquoTheory of image reconstructionin computed tomographyrdquo Radiology vol 117 no 3 I pp 561ndash572 1975

[4] L Moisan ldquoPeriodic plus smooth image decompositionrdquo Jour-nal of Mathematical Imaging and Vision vol 39 no 2 pp 161ndash179 2011

[5] B Akin P A Milder F Franchetti and J C Hoe ldquoMemorybandwidth efficient two-dimensional fast Fourier transformalgorithm and implementation for large problem sizesrdquo inProceedings of the 20th IEEE International Symposium on Field-Programmable Custom Computing Machines FCCM 2012 pp188ndash191 Canada May 2012

[6] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[7] H Kee N Petersen J Kornerup and S S BhattacharyyaldquoSystematic generation of FPGA-based FFT implementationsrdquoin Proceedings of the 2008 IEEE International Conference onAcoustics Speech and Signal Processing ICASSP pp 1413ndash1416USA April 2008

[8] F Mahmood M Toots L-G Ofverstedt and U Skoglundldquo2DDiscrete Fourier Transformwith simultaneous edge artifactremoval for real-Time applicationsrdquo in Proceedings of the Inter-national Conference on Field Programmable Technology FPT2015 pp 236ndash239 New Zealand December 2015

[9] M Frigo and S G Johnson ldquoFFTW an adaptive software archi-tecture for the FFTrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing vol 3 pp1381ndash1384 IEEE May 1998

[10] M Puschel J M F Moura J R Johnson et al ldquoSPIRAL codegeneration for DSP transformsrdquo Proceedings of the IEEE vol 93no 2 pp 232ndash273 2005

[11] EWang Q Zhang B Shen et alHigh-Performance Computingon the Intel Xeon Phi Springer International Publishing 2014

[12] B Akin F Franchetti and J C Hoe ldquoFFTS with near-optimalmemory access through block data layoutsrdquo in Proceedings ofthe 2014 IEEE International Conference onAcoustics Speech andSignal Processing ICASSP 2014 pp 3898ndash3902 Italy May 2014

[13] B Akin F Franchetti and J C Hoe ldquoUnderstanding thedesign space of DRAM-optimized hardware FFT acceleratorsrdquoin Proceedings of the 25th IEEE International Conference onApplication-Specific Systems Architectures and Processors ASAP2014 pp 248ndash255 Switzerland June 2014

[14] C-L Yu K Irick C Chakrabarti and V Narayanan ldquoMul-tidimensional DFT IP generator for FPGA platformsrdquo IEEETransactions on Circuits and Systems I Regular Papers vol 58no 4 pp 755ndash764 2011

[15] DHe andQ Sun ldquoApractical print-scan resilientwatermarkingschemerdquo in Proceedings of the IEEE International Conference onImage Processing 2005 ICIP 2005 pp 257ndash260 Italy September2005

[16] D G Bailey Design for embedded image processing on FPGAsJohn Wiley Sons 2011

[17] B G Batchelor ldquoImplementing Machine Vision Systems UsingFPGAsrdquo in Machine Vision Handbook 1136 p 1103 SpringerLondon UK 2012

[18] T Lenart M Gustafsson and V Owall ldquoA hardware accelera-tion platform for digital holographic imagingrdquo Journal of SignalProcessing Systems vol 52 no 3 pp 297ndash311 2008

[19] G H Loh ldquo3D-Stacked Memory Architectures for Multi-coreProcessorsrdquo ACM SIGARCH Computer Architecture News vol36 no 3 pp 453ndash464 2008

[20] Z Qiuling B Akin H E Sumbul et al ldquoA 3D-stacked logic-in-memory accelerator for application-specific data intensivecomputingrdquo in Proceedings of the IEEE International 3D SystemsIntegration Conference pp 1ndash7 October 2013

[21] I S Uzun A Amira and A Bouridane ldquoFPGA implementa-tions of fast Fourier transforms for real-time signal and imageprocessingrdquo pp 283ndash296

[22] T Dillon ldquoTwo Virtex-II FPGAs deliver fastest cheapest besthigh-performance image processing systemrdquo rdquo Xilinx XcellJournal vol 41 pp 70ndash73 2001

[23] A Hast ldquoRobust and invariant phase based local featurematchingrdquo in Proceedings of the 22nd International Conferenceon Pattern Recognition ICPR 2014 pp 809ndash814 Sweden August2014

[24] B Galerne Y Gousseau and J-M Morel ldquoRandom phasetextures theory and synthesisrdquo IEEE Transactions on ImageProcessing vol 20 no 1 pp 257ndash267 2011

[25] R Hovden Y Jiang H L Xin and L F Kourkoutis ldquoPeriodicArtifact Reduction in Fourier Transforms of Full Field AtomicResolution Imagesrdquo Microscopy and Microanalysis vol 21 no2 pp 436ndash441 2014

[26] D G Bailey ldquoThe advantages and limitations of high levelsynthesis for FPGA based image processingrdquo in Proceedings ofthe the 9th International Conference pp 134ndash139 Seville SpainSeptember 2015

[27] W Wang B Duan C Zhang P Zhang and N Sun ldquoAcceler-ating 2D FFT with non-power-of-two problem size on FPGArdquoin Proceedings of the 2010 International Conference on Recon-figurable Computing and FPGAs ReConFig 2010 pp 208ndash213Mexico December 2010

[28] F NattererTheMathematics of Computerized Tomography JohnWiley amp Sons 1986

[29] R A Crowther D J DeRosier and A Klug ldquoThe Reconstruc-tion of aThree-Dimensional Structure from Projections and itsApplication to Electron Microscopyrdquo Proceedings of the RoyalSociety A Mathematical Physical and Engineering Sciences vol317 no 1530 pp 319ndash340 1970

International Journal of Reconfigurable Computing 17

[30] U Skoglund L-G Ofverstedt R M Burnett and G BricogneldquoMaximum-entropy three-dimensional reconstruction withdeconvolution of the contrast transfer function A test applica-tion with adenovirusrdquo Journal of Structural Biology vol 117 no3 pp 173ndash188 1996

[31] F Mahmood N Shahid P Vandergheynst and U SkoglundldquoGraph-based sinogram denoising for tomographic reconstruc-tionsrdquo in Proceedings of the 38th Annual International Confer-ence of the IEEE Engineering in Medicine and Biology SocietyEMBC 2016 pp 3961ndash3964 USA August 2016

[32] F Mahmood N Shahid U Skoglund and P VandergheynstldquoAdaptive Graph-Based Total Variation for TomographicReconstructionsrdquo IEEE Signal Processing Letters vol 25 no 5pp 700ndash704 2018

[33] L A Shepp and B F Logan Jr ldquoThe Fourier reconstruction of ahead sectionrdquo IEEE Transactions on Nuclear Science vol 21 no3 pp 21ndash43 1974

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 5: Algorithm and Architecture Optimization for 2D Discrete ...downloads.hindawi.com/journals/ijrc/2018/1403181.pdf · Algorithm and Architecture Optimization for 2D Discrete Fourier

International Journal of Reconfigurable Computing 5

(a) (b)

(c) (d)

(e) (f)

Figure 3 (1a) An image with nonperiodic boundary (1b) 2D DFTof (1a) (1c) DFT of the smooth component ie the removedartifacts from (1a) (1d) Periodic component ie DFT of (1a) withedge artifacts removed (1e) Reconstructed smooth component (1f)Reconstructed periodic component

Simultaneously removing the edge artifacts while cal-culating a 2D FFT imposes an additional design challengeregardless of the method used However these artifacts mustbe removed in applications where they may be propagated tosubsequent processing levels An ideal method for removingthese artifacts should involve making the image periodicwhile removing minimal information from the image Peri-odic plus smooth decomposition (PSD) first presented byMoisan [4] and used in [23ndash25] is an ideal method forremoving edge artifacts (specifically for biomedical appli-cations) because it does not directly intervene with pixelsbeside those of the boundary and does not increase imagesizeMoreover its inherently parallel naturemakes it ideal fora high-throughput FPGA-based implementation We havefurther optimized the original PSD decomposition algorithmto make the overall implementation much more efficient by

decreasing the number of required 1D FFT invocations andby reducing external DRAM utilization (Section 4)

24 LabVIEW FPGA High-Level Design Environment Amajor concern while designing complex image processinghardware accelerators is how to fully harness on the divide-and-conquer approach Algorithms that have to be mappedto multiple FPGAs are often marred by communicationproblems and custom FPGA boards reduce flexibility forlarge-scale and evolving designs For rapid prototyping ofour algorithms we used LabVIEW FPGA 2016 (NationalInstruments) a robust data-flow-based graphical designenvironment LabVIEW FPGA provides integration withNational Instruments (NI) Xilinx-based reconfigurable hard-ware allowing efficient communication with a host PC andhigh-throughput communication between multiple FPGAsthrough a PXIe (PCI eXtensions for Industry Express) busLabVIEW FPGA also enables us to integrate external Hard-ware Description Language (HDL) code and gives us theflexibility to expand our design for future processing stagesWe used NI PXIe-7976R FPGA boards that have a XilinxKintex 7 FPGA and 2GB high-bandwidth (10GBs) externalmemory This platform has already been extensively used forrapid prototyping of communication standards and protocolsbefore moving to ASIC designs The optimizations anddesigns we present here are scalable to most reconfigurablecomputing-based systems Moreover LabVIEW FPGA pro-vides efficient high-level control over memory via a smartmemory controller

3 Periodic Plus Smooth Decomposition (PSD)for Edge Artifact Removal (EAR)

Periodic plus smooth decomposition (PSD) involves decom-posing the image into a periodic and smooth componentto remove edge artifacts with minimal loss of informationfrom the image [4] This section presents an overview ofthe PSD algorithm and profiles the algorithm for possibleparallelization and optimization to achieve efficient FPGAimplementation

Let us have discrete 119899 by 119898 gray-scale image 119868 on a finitedomain Ω = 0 1 119899 minus 1 times 0 1 119898 minus 1 The discreteFourier transform (DFT) of 119868 is defined as

(119904 119905) = sum(푖푗)isinΩ

119868 (119894 119895) exp(minus1205802120587 (119904119894119899 + 119905119895119898)) (1)

This is equivalent to a matrix multiplication119882119868119881 where

119882 = (((((((

1 1 1 11 119908 1199082 119908푛minus11 1199082 1199084 1199082(푛minus1) 1 119908푛minus2 1199082(푛minus2) 119908(푛minus2)(푛minus1)1 119908푛minus1 1199082(푛minus1) 119908(푛minus1)(푛minus1)

)))))))

(2)

6 International Journal of Reconfigurable Computing

Step B Step C Step D

Step EStep A

DMAFIFODMA

FIFO

DMAFIFO

IO

Host PCFPGA 1

FPGA 2

SmartImagingDevice Host PC

2D FFT(RCD with DRAM

IntermediateStorage)

2D FFT(RCD with DRAM

IntermediateStorage)

ComputePeriodicBoarder

ComputeSmooth

Component

ComputePeriodic

Component

Figure 4 A top-level architecture for OPSD using two FPGAs and a host PC connected over a high-bandwidth bus The steps are associatedwith Algorithm 1

and

119908푘 = exp(minus1198942120587119899 )푘 = exp(minus1198942120587119896119899 ) (3)

119881 has the same structure as119882 but is m-dimensional 119908푘 hasperiod 119899 which means that 119908푘 = 119908푘+푙푛 forall119896 119897 isin N therefore

119882 = ((((((

1 1 1 1 11 119908 1199082 119908푛minus2 119908푛minus11 1199082 1199084 119908푛minus4 119908푛minus2 1 119908푛minus2 119908푛minus4 1199084 11990821 119908푛minus1 119908푛minus2 1199082 1199081))))))

(4)

Since in general 119868 is not (119899119898)-periodic there will behigh-amplitude edge artifacts present in the DFT stemmingfrom sharp discontinuities between the opposing edges ofthe image as shown in Figure 3(b) Reference [4] proposeda decomposition of 119868 into a periodic component 119875 whichis periodic and captures the essence of the image with allhigh-frequency details and a smoothly varying background119878 which recreates the discontinuities at the borders so 119868 =119875+ 119878 Periodic plus smooth decomposition can be computedby first constructing a border image 119861 = 119877 + 119862 where 119877represents the boundary discontinuities when transitioningrow-wise and 119862when going column-wise

119877 (119894 119895)= 119868 (119899 minus 1 minus 119894 119895) minus 119868 (119894 119895) 119894 = 0 or 119894 = 119899 minus 10 otherwise

119862 (119894 119895)= 119868 (119894 119898 minus 1 minus 119895) minus 119868 (119894 119895) 119895 = 0 or 119895 = 119898 minus 10 otherwise

(5)

It is obvious that the structure of the border image119861 is simplewith nonzero values only in the edges as shown below

119861 = 119877 + 119862= (((

11988711 11988712 1198871푚minus1 1198871푚11988721 0 0 minus11988721 119887푛minus11 0 0 minus119887푛minus11119887푛1 minus11988712 minus1198871푚minus1 minus119887푛푚)))

(6)

The DFT of the smooth component 119878 can be then found bythe following formula

(119904 119905) = (119904 119905)2 cos (2120587119904119899) + 2 cos (2120587119905119898) minus 4 forall (119904 119905) isin Ω (0 0) (7)

The DFT of the image 119868 with edge artifacts removed isthen = minus Figures 3(c) and 3(d) show the DFT ofthe smooth and periodic components respectively Figures3(e) and 3(f) show the reconstructed periodic and smoothcomponents On reconstruction it is evident that there isnegligible visual difference between the actual image and theperiodic reconstructed image for this example

31 Profiling PSD for FPGA Implementation Algorithm 1summarizes the overall PSD implementation There areseveral ways of arranging the algorithm We have arrangedit so that DFTs of the periodic and smooth componentsare readily available for further processing stages For bestresults both the periodic and smooth components shouldundergo similar processing stages and should be added backtogether before displaying the result However dependingon the application it might be acceptable to discard theperiodic component completely For an 119899 times 119898 image stepsA and C have a complexity of O(119899119898 log(119899119898)) and steps Band D have complexity O(119898 + 119899) and O(119898119899) respectivelyComputationally the performance of PSD is limited by stepsA and C Figure 4 shows a proposed top-level architecture

International Journal of Reconfigurable Computing 7

where step A and steps B C and D are completed on separateFPGAs while step E can be done on the host PC Two highend FPGAs are used instead of one because resources onone FPGA are insufficient to compute a large size 2D FFTas well its edge artifact removal components The overallperformance may be limited by FPGA 2 where most of theserial part of the algorithm lies There are two major factorswhich limit the throughput of such a design

(1) While FPGA 1 and FPGA 2 can run in parallel theresult of step A from FPGA 1 has to be stored on thehost PC while steps B C and D are completed onFPGA 2 before step E can be completed on the hostPC

(2) The DRAM intermediate storage problem explainedin Section 22 and Figures 1 and 2 has to be addressedsince strided access to the DRAM for column-wiseoperations can significantly limit throughput

As for (1) it has been addressed in the next section wherewe make use of the inherent symmetry of the boundaryimage to reduce the time required to compute the 2D FFTof the boundary image As for (2) it has been addressed bydesigning a semi-custommemory mapping controller whichrdquotilesrdquo the DRAM floor and rdquohopsrdquo between several tiles so asto minimize strided memory access

4 Proposed Optimized Periodic Plus SmoothDecomposition (OPSD)

In this section we optimize the original PSD algorithmThisoptimization is to effectively reduce the number of 1D FFTinvocations and the number of times the DRAM is accessed

Equation (6) shows that except the corners the boundaryimage 119861 is symmetrical in the sense that boundary rows andcolumns are algebraic negation of each other In total119861has 119899+119898 minus 1 unique elements with the following relations betweencorners with respect to columns and rows

11988711 = 11990311 + 119888111198871푚 = 1199031푚 minus 11988811119887푛1 = minus11990311 + 119888푛1119887푛푚 = minus1199031푚 minus 119888푛1(8)

997904rArr 119887푛푚 = minus11988711 minus 1198871푚 minus 119887푛1 (9)

In computing the FFTof119861 one normally proceeds by firstrunning 1D FFTs column-by-column and then 1D FFTs row-by-row or vice versa An FFT of a column vector 119907with length119899 is119882119907 where119882 is given in (4)The column-wise FFT of thematrix 119861 is then

=119882119861 (10)

Let us have a closer look at the first column denoted by 119861sdot1The 1D FFT of this vector is

sdot1 =119882119861sdot1

= (((((((

1 1 11 119908 119908푛minus11 1199082 1199082(푛minus1) 1 119908푛minus2 119908(푛minus2)(푛minus1)1 119908푛minus1 119908(푛minus1)(푛minus1)

)))))))

((((((

119887111198872111988731 119887푛minus11119887푛1

))))))

=((((((((((((((((

푛sum푖=1

119887푖1푛sum푖=1

119887푖1119908푖minus1푛sum푖=1

119887푖11199082(푖minus1) 푛sum푖=1

119887푖1119908(푛minus2)(푖minus1)푛sum푖=1

119887푖1119908(푛minus1)(푖minus1)

))))))))))))))))

(11)

It can be shown that the 1D FFT of the column 119895 isin2 3 119898 minus 1 issdot119895 =119882119861sdot119895

= (((((((

1 1 11 119908 119908푛minus11 1199082 1199082(푛minus1) 1 119908푛minus2 119908(푛minus2)(푛minus1)1 119908푛minus1 119908(푛minus1)(푛minus1)

)))))))

((((((

1198871푗00 0minus1198871푗

))))))

= 1198871푗((((((

01 minus 119908푛minus11 minus 119908푛minus2 1 minus 11990821 minus 119908

))))))

= 1198871푗^

(12)

8 International Journal of Reconfigurable Computing

Input 119868(119894 119895) of size 119899 times 119898Output (119904 119905) (119904 119905)

Step A Compute the 2D DFT of image 119868(119894 119895)1 119868(119894 119895) F997888rarr (119904 119905)Step B Compute periodic border 119861

2 while 1 lt 119895 lt 119898 do3 while 1 lt 119894 lt 119899 do4 if (119894 = 0 or 119894 = 119899 minus 1) then5 119877(119894 119895) larr997888 119868(119899 minus 1 minus 119894 119895) minus 119868(119894 119895)6 else7 119877(119894 119895) larr997888 08 end if9 if (119895 = 0 or 119895 = 119898 minus 1) then10 119862(119894 119895) larr997888 119868(119894 119898 minus 1 minus 119895) minus 119868(119894 119895)11 else12 119862(119894 119895) larr997888 013 end if14 end while15 end while16 119861larr997888 119877 + 119862Step C Compute the 2D DFT of (119904 119905)

17 119861(119894 119895) F997888rarr (119904 119905)Step D Compute the Smooth Component (119904 119905)

18 (119904 119905) larr997888 2 cos(2120587119904119899) + 2 cos(2120587119905119898) minus 419 (119904 119905) larr997888 (119904 119905) divide (119904 119905)

Step E Compute the Periodic Component (119904 119905)20 (119904 119905) larr997888 (119904 119905) minus (119904 119905)21 return (119904 119905) (119904 119905)

Algorithm 1 Periodic plus smooth decomposition (PSD)

and the 1D FFT of the last column 119861sdot119898 is

sdot119898 =119882119861sdot119898 (13)

=((((((((((((((((

minus 푛sum푖=1

119887푖1minus 푛sum푖=1

119887푖1119908푖minus1 + (11988711 + 1198871푚) (1 minus 119908푛minus1)minus 푛sum푖=1

119887푖11199082(푖minus1) + (11988711 + 1198871푚) (1 minus 1199082(푛minus1)) minus 푛sum푖=1

119887푖1119908(푛minus2)(푖minus1) + (11988711 + 1198871푚) (1 minus 119908(푛minus2)(푛minus1))minus 푛sum푖=1

119887푖1119908(푛minus1)(푖minus1) + (11988711 + 1198871푚) (1 minus 119908(푛minus1)(푛minus1))

))))))))))))))))

(14)

sdot119898 = minussdot1 + (11988711 + 1198871푚) ^ (15)

Therefore the column-wise FFT of the matrix 119861 is

= (119861sdot1 11988712^ 1198871푚minus1^ minussdot1 + (11988711 + 1198871푚) ^) (16)

To compute the column-by-column 1DFFT of thematrix119861 we only have to compute the FFT of the first vectorand then use the appropriately scaled vector ^ to derivethe remainder of the columns The row-by-row FFT has

to be calculated in a row burst normal way Algorithm 2presents a summary of the shortcut for calculating (119904 119905)The steps presented in Algorithm 2 can replace step Cin Algorithm 1 By reducing column-by-column 1D FFTcomputations for the boundary image this method cansignificantly reduce the number of 1D FFT invocationsreduce the overall DRAM access and eliminate problematiccolumn-wise strided DRAM access for an efficient FPGA-based implementation For column-wise operations a single1D FFT of size 119898 is required rather than 119899119898 1D FFTs of size119898 Moreover since one has to simply store one column ofdata it can be stored on the on-chip local memory (BRAMor SRAM) This can be implemented by temporarily storingthe initial vector sdot1 and scaling factors 1198871푗 in the blockRAMregister memory drastically reducing DRAM accessand lowering the number of required 1D FFT invocationsPerformance evaluation for this has been presented in theresults section

Table 1 shows a comparison of mirroring PSD andour proposed OPSD with respect to DRAM access pointsMirroring has been used for comparison purposes because itis an alternative technique that reduces edge artifacts whilemaintaining maximum amplitude information Howeverdue to replication of the imagemost of the phase informationis lost Figure 5 graphically shows that our OPSDmethod sig-nificantly reduces reading fromexternalmemory and reduces

International Journal of Reconfigurable Computing 9

Input 119861(119894 119895) of size 119899 times 119898Output (119904 119905)2DF(119861) lArrrArr 119861 F

119888119908997888997888997888rarr 119888119908 F119903119908997888997888997888rarr

Column-by-Column DFT via Symmetrical Short-cut1 119861sdot1

F997888rarr sdot12 while 1 lt 119895 lt 119898 do3 sdot119895 larr997888 1198871푗^4 end while5 sdot119898 larr997888 minussdot1 + (11988711 + 1198871푚)^6 119888119908 larr997888 119862119900119899119888119886119905119890119899119886119905119890 sdot1 sdot2 sdot119898minus1 sdot119898

Row-by-Row DFT7 119888119908

F119903119908997888997888997888rarr (119904 119905)

8 return (119904 119905)cw column-wisecolumn-by-columnrw row-wiserow-by-row

Algorithm 2 Proposed symmetrically optimized computation of (119904 119905)

DRA

M A

cces

s Poi

nts

2

18

16

14

12

1

08

06

04

02

0

Image Size (N=M)0 1000 2000 3000 4000 5000

DRAM Access - PSD vs OPSD

MirroringPSDOPSD (Proposed)

times108

Figure 5 Graph showing DRAM access (equal to number of DFT points to be computed) with increasing image size for mirroring periodicplus smooth decomposition (PSD) and our proposed optimized period plus smooth decomposition (OPSD)

the overall number of DFT computations required It shouldbe noted that such optimization is only possible for eithercolumn-wise or row-wise operations because after either ofthese operations the output is not symmetrical anymoreCompleting the column-wise operation first prevents stridedreading however this results in strided writing to the DRAMbefore row-wise traversal can startThis can be minimized bymaking efficient use of the local block RAM Output columnsare stored in the block RAM before being written to theDRAM in patches such that each row buffer access writeselements in several columns

5 Proposed Tile-Hopping Memory Mappingfor 2D FFTs

In this section we propose a tile-hopping external memoryaccess pattern for efficiently addressing external memory

during intermediate storage between row-wise and column-wise 1D FFT operations to calculate a 2D FFT As explainedin Section 22 and Figure 2 column-wise reads from DRAMcan be costly due to the overhead associated with activatingand precharging In the worst case scenario it can limitDRAM bandwidth up to 80 [26] This is a problem withall such image processing operations where one stage of theprocessing has to be completed on all elements before thenext stage can start In the past there have been severalimplementations using localmemory however with growingdemand for larger image sizes external memory has to beused There have been several DRAM remapping attemptsbefore such as [5 13] They propose a tile-based approachwhere an 119899 times 119899 image (input array) is divided into 119899119896 times 119899119896tiles where 119896 is the size of the DRAM row buffer whichallows for very high-bandwidthDRAMaccess Although this

10 International Journal of Reconfigurable Computing

Table 1 Comparing mirroring PSD and OPSD

Algorithm DRAM Access DFTPoints Points

Mirroring 8119873119872 8119873119872P + S Decomposition (PSD) 4119873119872 4119873119872Optimized PSD (Proposed) 3119873119872 +119873 +119872 minus 1 3119873119872 +119872method may be ideal to maximize the DRAM performancefor 2D FFTs it incurs a high resource cost associated withlocal memory transposition and storing large chunks of data(entire rowcolumn of tiles) in the local memory Moreovertiling in the image domain also requires remapping row-by-row operations Another approach to reducing stridedDRAM access has been presented in [27] They present a2D decomposition algorithmwhich decomposes the probleminto smaller subblock 2D FFTs which can be performedlocallyThis introduces extra row and columndata exchangesand total number of operations are increased fromO(1198992 log 119899)toO(1198992(1+log 119899)) Other implementations do not address theexternal memory issue in detail

We propose tile-hopping address mapping which reducesthe number of row activations required to access a singlecolumn Unlike [5] our approach does not require significantlocal operations or storage The proposed memory mappingcontroller was designed on top of LabVIEW FPGArsquos existingmemory controller which efficiently controls interleaving andissues activation and precharge commands in parallel withdata transfer The reduced number of row activations alsoreduces the amount of energy required by the DRAM Thiswill be further discussed in the energy evaluation presentedin Section 64 and Table 3 where we demonstrate that theproposed tile-hopping address mapping can reduce energyconsumption up to 53

Instead of writing the results of the row-by-row 1D FFTin row-major order we remap the results in a blocked or tiledpattern as shown in Figure 6This means that when accessingan image column several elements of that column can beretrieved from a single DRAM row access For an 119899times119899 imageeach row of size 119899 can be divided into ℎ tiles (ie 119899 = ℎ119873(119905)where119873(119905) is the number of elements in each tile)These tilescan be remapped onto the DRAM floor as shown in Figure 6If the size of the row is small enough it may be possible toconvert it into a single tile (ie ℎ = 1) However this isunlikely for realistic image sizes For a tile of size119901times119902 a singlerow of the image is written into the DRAM by transitioningthrough 119901ℎ rows If 119896 is the size of the row buffer thereare 119896119902 distinct tiles represented in each DRAM row and itcontains the same number of elements from a single imagecolumn Given regular row-major storage when accessingcolumn-wise elements one would have to transition through119899 DRAM rows to read a single image column Howeverwith this approach when accessing an image column 119896119902elements of that column could be read from a single DRAMrow which has been referred to in the row buffer Althoughthe cost of writing an image row is higher when compared toa standard row-major DRAM writing pattern (ie referring

to119901ℎ rather than 119899 rows) the number ofDRAMrow referralsduring column-wise read is reduced to 119899119902119896 which is 119899(1 minus119902119896) less row referrals for a single column read

We refer to this method as tile-hopping because it entailsmapping data onto several DRAM tiles and then hoppingbetween the tiles such that several elements of the imagecolumn exist in a DRAM row which has been referred toin the row bank Although this mapping scheme has beendeveloped for column-wise access required during 2D FFTcalculation the scheme is general and can be adapted to otherapplications Performance evaluation of thismethod has beenpresented in the experiments section

6 Experimental Results and Analysis

61 Hardware Configuration and Target Selection Since 2DDFTs are usually used for simplifying convolution operationsin complex image processing andmachine vision systems weneeded to prototype our design on a system that is expandablefor next levels of processing As mentioned earlier for rapidprototyping of our proposed OPSD algorithm and tile-hopping memory mapping scheme we used a PXIe-basedreconfigurable system PXIe is an industrial extension ofa PCI system with an enhanced bus structure that giveseach connected device dedicated access to the bus with amaximum throughput of 24GBs This allows a high-speeddedicated link between a host PC and several FPGAs TheLabVIEWFPGAgraphical design environment is efficient forrapid prototyping of complicated signal and image processingsystems It allows us to effectively integrate external HDLcode and LabVIEW graphical design on a single platformMoreover it allows a combination of high-level synthesis(HLS) and custom logic Since current HLS tools havelimitations when it comes to complex image and signalprocessing tasks LabVIEW FPGA tries to bridge these gapsby streamlining the design process

We used FlexRIO (Flexible Reconfigurable IO) FPGAboards plugged into a PXIe chassis PXIe FlexRIO FPGAboards are adaptable and can be used to achieve highthroughput because they allow direct data transfer betweenmultiple FPGAs at rates as high as 8GBs This can signifi-cantly simplify multi-FPGA systems which usually commu-nicate via a host PC This feature allows expansion of oursystem to further processing stages making it flexible for avariety of applications Figure 7 shows a basic overview ofa PXIe-based multi-FPGA system with a host PC controllerconnected through a high-speed bus on a PXIe chassis

Specifically we used two NI PXIe-7976R FlexRIO boardswhich have Kintex 7 FPGA and 2GB external DRAM withtheoretical data bandwidth up to 10GBs This FPGA boardwas plugged into a PXIe-1085 chassis along with a PXIe-8880Intel Xeon PC controller PXIe-1085 can hold up to 16 FPGAsand has 8 GBs per-slot dedicated bandwidth and an overallsystem bandwidth of 24 GBs

62 Experimental Setup As per Algorithms 1 and 2 dis-cussed in previous sections implementation involves fivestages (A) calculating the 2D FFT of an image frame (B)

International Journal of Reconfigurable Computing 11

n

n

Row-by-Row (RR) Output

(a) Dataset view

k

p

q

Writing RR Output to DRAM

Row Buffer

(b) Memory view

k

Reading from DRAM

Row Buffer

(c) Memory view

Figure 6 Image showing tile-hopping (a) Image-level view showing tiles (b) DRAM-level view showing tile placement while writing (c)DRAM-level view showing column reading from the tiles

External IO PXIe

Host PC ControllerDisplay

Device

PXIeFPGA Board 1

High-throughput PXIe BUSPXIe Chassis

External IO

PXIeFPGA Board 2

External IO

PXIeFPGA Board 3

External IO

Figure 7 Block diagram of a PXIe-based multi-FPGA system witha host PC controller connected through a high-speed bus on a PXIechassis [8]

calculating the boundary image (C) calculating the 2DFFT of the boundary image (D) calculating the smoothcomponent and (E) subtracting the smooth component fromthe 2D FFT of the original image to achieve the periodiccomponent The bottleneck consistently occurs in A and theserial part of the algorithm (A 997888rarr B 997888rarr C) The limitation

due to A is reduced removing the so-called ldquomemory wallrdquoby using our proposed tile-hopping-based memory mappingThe limitations due to the serial part of the algorithm arereduced by using OPSD rather than PSD For quantificationthe delay for A is 062ms for a 512 times 512 image

The design flow presented in Figure 4 was followedData-flow is clearly shown in a graphical programmingenvironment making it easier to visualize how a designefficiently fits on an FPGA Highly efficient implementationsof 1D FFT were used from LabVIEW FPGA for parallelrow-by-row operations and by integrating Xilinx LogiCOREfor column-by-column operations Each stage of the designwas dynamically tested and benchmarked The image wasstreamed from the host PC using a Direct Memory Access(DMA) FIFO 1D FFTs are performed in parallel rows of 8and stored in the DRAM via local memory in a tiled patternas explained in the previous section This follows readingseveral rows to extract a single column which is Fourier-transformed using Xilinx LogiCORE and is sent back to

12 International Journal of Reconfigurable Computing

I(00)

I(n-1 m-1)

i

j

N x M size image

a b

a

1D FFT

c

Ori

gina

l Im

age

Boun

dary

Imag

e

TrigonometricFunctions

Equation (7)

FFT ofSmooth Component

FFT of Periodic Component(With Edge Artifact

Removed)

b

Shortcut for Column-wise 1D FFT

Exte

rnal

Mem

ory

Exte

rnal

Mem

ory

Row-wise 1D FFT

c

Bloc

k M

emor

y

Exte

rnal

Mem

ory

Row-wise 1D FFT

Host PC Controller(NI PXIe-8880)

FPGA-1(NI PXIe-7976R)

Host PC Controller (NI PXIe-8880)

PXIe Chassis (NI-PXIe1085)

FPGA-2(NI PXIe-7976R)Block Memory

Bloc

k M

emor

y

+

minusColumn-wise 1D FFTlowast

Row-Major Memory Mapping Tile-Hopping Memory Mapping

I(n-1 j) ndash I(0 j)

I(i

0) ndash

I(i

m-1

)

I(i

m-1

) ndash I(

i 0)

I(0 j) ndash I(n-1 j)

middot1

b1jv

-middot1 + (b11+b1m)v

Figure 8 Functional block diagramof PXIe-based 2DFFT implementationwith simultaneous edge artifact removal using optimized periodicplus smooth decomposition The OPSD algorithm is split among two NI-7976R (Kintex-7) FPGA boards with 2GB external memory and ahost PC connected over a high-bandwidth bus The image is streamed from the PC controller to FPGA 1 and FPGA 2 FPGA 1 calculates therow-by-row 1D FFT followed by column-by-column 1D FFT with intermediate tile-hopping memory mapping and sends the result back tothe host PC FPGA 2 receives the image calculates the boundary image and proceeds to calculate the 1D FFT column-by-column FFT usingthe shortcut presented in (16) followed by row-by-row 1D FFTs and the result is sent back to the host PC

Camera

HostPC

External Memory

FPGA Board 1

FPGA

DMAFIFO

FIFO IN(Local Memory - (Local Memory ndash

Write)

FIFO OUT

Read)

PXIe

BU

S

CU

1D FF4H

1D FF41

1D FF42

Figure 9 Block diagram of 2D FFT showing data transfer betweenexternal memory and local memory scheduled via a Control Unit(CU)

the host PC If the image is being streamed directly froman imaging device which scans and provides random ornonlinear sequence of rows it is necessary to store a frameof the image in a buffer This can also be accomplishedby streaming the image flow from the host PC or using asmart camera which can delay image delivery by a singleframe Local memory shown in Figure 9 is used to bufferdata between external memory and 1D FFT cores Thismemory is divided into read and write components and isimplemented using FPGA slices Block RAM (BRAM) is usedfor temporary storage of vectors required for calculating the2D FFT of the boundary image (in the case of FPGA 2)The Control Unit (CU) organizes scheduling for transferring

data between local and external memory CU is based onLabVIEWrsquos existing memory controller and our memorymapping scheme presented in Section 5

Step B was accomplished using standard LabVIEWFPGA HLS tools for programming (5) using the graphicalprogramming environment In step C the 2D FFT of theboundary image needs to be calculated by row and columndecomposition However as shown mathematically in theprevious section the initial row-wise FFTs can be calculatedby computing the 1D FFT of the first (boundary) vector andthe FFTs of reaming vectors can be computed by appropriatescaling of this vector

We need the boundary column vector for 1D FFT calcula-tion of the first and last columns We also need the boundaryrow vector for appropriate scaling of for the 1D FFT ofevery column between the first and last columns Row andcolumn vectors of the boundary image are stored in blockRAM (BRAM) Figure 8 shows a functional block diagramof the overall 2D FFT with optimized PSD process Steps Dand E are performed on the host PC to minimize memoryclashes and to access the periodic and smooth componentsof each frame as they become available 86 of resources areused on FPGA 1 and 41 resources are used on FPGA 2 Theresource utilization is reported according to LabVIEWFPGAsynthesis and compilation experiments It should be notedthat part of the reason for high resource utilization is becauseof using LabVIEW FPGA high-level synthesis tools as wellas Xilinx LogiCORE tools Using standardized tools makesit harder to optimize for resource utilization The currentimplementation was optimized for performance in terms ofrun time and energy consumption

International Journal of Reconfigurable Computing 13

Fram

esS

econ

d (1

s)

105

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7)Kintex-7 (No DRAM Opt)Kintex-7 (DRAM Opt)Theoretical Peak

(a) 2D FFT performance DRAM tile-hopping

Fram

esS

econ

d (1

s)

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7) (PSD)Kintex-7 (PSD)Kintex-7 (OPSD)

(b) 2D FFT with EAR performance PSD versus OPSD

Figure 10 Performance evaluation in terms of frames per second for (a) 2D FFTs with tile-hopping memory pattern (b) 2D FFTs with edgeartifact removal (EAR) using OPSDThe performance evaluation shows the significance of the two optimizations proposed Both axes are ona log scale

63 Performance Evaluation The overall performance of thesystem was evaluated using the setup presented in Figure 9The data was streamed from the host PC in certain caseshigh frame rate videos as well as direct camera input werestreamed from the host All results presented are for 16-bitfixed-point precision Figure 10(a) presents the effectivenessof our proposed tile-hopping memory mapping scheme Itclearly shows the effectiveness of our proposed memorymapping since it is closer to the theoretical peak performanceFigure 10(b) presents the overall results comparing PSD andOPSD and demonstrating the effectiveness of our proposedoptimization PSD was also implemented on the same plat-form but the optimization presented in Section 4 was notused This rendered the serial portion of the algorithm to bethe bottleneck which reduced overall performance Table 2shows a comparison of our implementation in contrast torecent 2D FFT FPGA implementation approaches and showsthat we achieve a better performance even with simultaneousedge artifact removal Although our implementation is testedwith 16-bit fixed-point precision which limits the accuracy ofthe transform the precision may be sufficient for a varietyof speed critical applications where alternative edge artifactremoval methods (eg filtering) may decrease overall systemperformance Although the dynamic range of fixed-pointdata is smaller than floating-point data which can lead toerrors in the 2D FFT for certain applications it is moreimportant to have an artifact-free transform as compared toa highly accurate transform

64 Energy Evaluation The overall energy consumptionof the custom computing system depends on (1) powerperformance of the system components and (2) throughputdisparity Throughput disparity results in idle time for at

least one of the components and lowers the overall sys-tem throughput A throughput-optimized system minimizesinstances where certain components of the architecture areidle The proposed optimizations in Sections 4 and 5 clearlyreduce throughput disparity andminimize the idle time of thesystem Thus besides causing delays due to significant over-head standard column-wise DRAM access also contributesto the overall energy consumption This is not only due tohigh count of DRAM row charges but also because of energyconsumed by the FPGA in idle state Ideally maximizing theDRAMbandwidth limits the amount of energy consumptionProposed ldquotile-hoppingrdquo memory mapping scheme improvesthe DRAMbandwidth as seen in Figure 10 and hence reducesthe overall energy consumption The same is true for theproposed OPSD method where reduced DRAM access and1D FFT invocations lead to reduced energy consumption Inthis section we analyze the amount of improvement in energyconsumption based on the proposed optimizations

We estimate the DRAM power consumption for boththe baseline (standard strided) and the optimized (ldquotile-hoppingrdquo) memory access using MICRON DRAM powercalculator The energy is calculated in 119899119869 for each readie energy per read This is accomplished by calculatingthe run time for a specific 2D FFT and estimating theamount of energy consumed by the DRAM power calculatorTable 4 depicts the DRAM energy consumption for 2D FFTsbefore and after the proposed ldquotile-hoppingrdquo optimization forcolumn-wise DRAM access As mentioned earlier row-wiseDRAM access is fast and row buffer size data can be accessedby a single row activation According to Table 4 the energyrequired for DRAM access is reduced by 427 488 and529 for 1024 times 1024 2048 times 2048 and 4096 times 4096 size 2DFFTs respectively

14 International Journal of Reconfigurable Computing

Table 2 Comparison of OPSD1 2D FFT with regular RCD-based implementation

Platform SEAR2 Precision RT3 (ms)YesNo bits 512 times 512 1024 times 1024

Kintex 7 28nm (ours) Yes 16 (fixed) 15 48Kintex 7 28nm (ours) No 16 (fixed) 09 41Kintex 7 28mm [8] Yes 16 (fixed) 324 1167Stratix IV [5] No 64 (double) - 61Virtex-5-BEE3 65nm[14] No 32 (single) 249 1026Virtex-E 180nm [21] No 16 (fixed) 286 769ASIC 180nm No 32 (single) 210 -1 Optimized periodic + smooth decomposition (OPSD)2 Simultaneous edge artifact removal3 Runtime (ms)

Table 3 DRAM energy consumption baseline vs tile-hopping1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPRlowast CW∘ Read(Baseline) 446 577 712EPR CW ReadTile-Hopping 254 295 336(Proposed)Reduction () 427 488 529lowastEnergy per read (EPR)∘Column-wise memory access (CW)

Table 4 2D FFT + EAR energy consumption baseline vs optimized (OPSD + tile hopping)1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPPdagger 2D FFT + EARBaseline 3692 4125 48352D FFT+EAR (Opt)Tile-Hopping + OPSD 1588 1706 1811(Proposed)Improvement 23times 24times 28timesdaggerEnergy per point (EPP)

The metric used to compare the overall energy optimiza-tion achieved for 2D FFTs with EAR is energy per point iethe amount of average energy required to compute the 2DFFT of a single point in an image with simultaneous edgeartifact removal This was achieved by calculating the energyconsumed by Xilinx LogiCORE IP for 1D FFTs the DRAMand the edge artifact removal part separately The estimatedenergy calculated does not include energy consumed by thePXIe chassis and the host PC Essentially the FPGA-basedarchitecture presented here could be used without the hostcontroller The energy consumption incorporates dynamicas well as static power The overall energy consumption perpoint is reduced by 569 586 and 62 for calculating1024 times 1024 2048 times 2048 and 4096 times 4096 size 2D FFTs withEAR respectively

7 Application Filtered Back-Projectionfor Tomography

In order to further demonstrate the effectiveness of ourimplementation we use the created 2D FFT module asan accelerator for reducing the run time for filtered back-projection (FBP) FBP is a fundamental analytical tomo-graphic image reconstruction method In depth detailsregarding the basic FBP algorithm have been left out forbrevity but can be found in [28 29]Themethod can be usedto reconstruct primitive 3D tomograms from 2D data whichcan then be used as a basis for more complex regularization-basedmethods such as [30ndash32]The algorithmic flow is basedon the Fourier slice theorem ie 2D Fourier transformsof projections are an angular component of the 3D Fourier

International Journal of Reconfigurable Computing 15

Table 5 Comparing filtered back-projection (FBP) runtime (as an application for using the proposed 2D FFT with simultaneous EAR)

3D Density CPU (i7) FPGA + Host PC (i7)Sec Sec128 times 128 times 128 213 sec 195 sec256 times 256 times 256 475 sec 424 sec512 times 512 times 512 948 sec 813 sec1024 times 1024 times 1024 3223 sec 2753 sec2048 times 2048 times 2048 16877 sec 13644 sec4096 times 4096 times 4096 164631 sec 125994 sec

Original Phantom CPU Reconstructed Phantom FPGA+CPU Reconstructed Ph

Inte

nsity

Inte

nsity

1

08

06

04

02

0

1

08

06

04

02

0

X Direction0 20 40 60 80 100 120

X Direction0 20 40 60 80 100 120

Original PhantomCPU BP

Original PhantomFPGA+CPU BP

Figure 11 Figure showing a thin slice of filtered back-projectionresults by reconstructing a 128 times 128 times 128 Shepp-Logan phantomThe 3Ddensitywas reconstructed from 180 equally spaced simulatedprojections using standard linearly interpolated FBP and using aRam-Lak filter It can be seen that the results from the FPGA + CPUsolution have some errors this is due to the fact that the 2D FFT ofeach projection is less accurate The results are good enough to beused as a basis for further optimization based refinement methods

transform of the 3D reconstructed volume Our 2D FFTaccelerator was used to calculate the 2D FFTs of the projec-tions as well as for initial stages of the 3D FFTwhich was thencompleted on the host PC Similar to the 2D FFT the 3D FFTis separable and can be divided into 2D FFTs and 1D FFTsThe results have been shown in Table 5 It can be seen thatthe improvement for smaller size densities is not significantbecause their FFTs are quite fast on general-purpose CPUsHowever for larger densities the FFT accelerator can give asignificant improvement If the remaining components arealso implemented on an FPGA significant speed increasecan be achieved Results of a thin slice from a 3D simulatedShepp-Logan [33] phantom have been shown in Figure 11 Itcan be seen that the results from the hardware acceleratedFBP are of slightly lower quality This is due to the factthat our 2D FFT implementation is less accurate (16 bitfixed-point) as compared to the CPU-based implementation(FFTW double-precision floating) The accelerated FBP wasalso tested with real Electron Tomography (ET) data

8 Conclusion

2D FFTs often become a major bottleneck for high-performance imaging and vision systems The inherentcomputational complexity of the 2D FFT kernel is fur-ther enhanced if effective removal (using PSD) of spuriousartifacts introduced by the nonperiodic nature of real-lifeimages is taken into accountWe developed and implementedan FPGA-based design for calculating high-throughput 2DDFTs with simultaneous edge artifact removal Our approachis based on a PSD algorithm that splits the frequency domainof a 2D image into a smooth component which contains thehigh-frequency cross-shaped artifacts and can be subtractedfrom the 2D DFT of the original image to obtain a periodiccomponent that is artifact-free Since this approach calculatestwo 2D DFTs simultaneously external memory addressingand repeated 1D FFT invocations become problematic Tosolve this problem we optimized the original PSD algorithmto reduce the number of DFT samples to be computed andDRAM access Moreover to reduce strided access from theDRAMduring column-wise reads we presented and analyzedldquotile-hoppingrdquo a memorymapping scheme which reduces thenumber of DRAM row activations when reading a singlecolumn of data This memory mapping scheme is generaland may be used for a variety of other applications Wedemonstrate that ldquotile-hoppingrdquomemorymapping can reducethe DRAM energy consumption by 529 Moreover weshow that the proposed optimizations lead to 28times less energyconsumption for the overall 2D FFT with EAR architecture

Our methods were tested using extensive synthesis andbenchmarking using a Xilinx Kintex 7 FPGA communicatingwith a host PC on a high-speed PXIe bus Our system isexpandable to support several FPGAs and can be adaptedto various large-scale computer vision and biomedical appli-cations Despite decomposing the image into periodic andsmooth frequency components our design requires lessrun time compared to traditional FPGA-based 2D DFTimplementation approaches and can be used for a varietyof highly demanding applications One such applicationfiltered back-projection was accelerated using the proposedimplementation to achieve better results specifically for largersize raw tomographic data

Conflicts of Interest

The authors declare that they have no conflicts of interest

16 International Journal of Reconfigurable Computing

Acknowledgments

Thisworkwas supported by JapaneseGovernmentOIST Sub-sidy for Operations (Ulf Skoglund) Grant no 5020S7010020FaisalMahmood andMart Toots were additionally supportedby the OIST PhD Fellowship The authors would like tothank National Instruments Research for their technicalsupport during the design process The authors would alsolike to thank Dr Steven D Aird for language assistance andShizuka Kuda for logistical arrangements

Supplementary Materials

File 2D FFT-EAR-Demomp4 video demonstrating theFPGA output of 2D FFT with edge artifact removal(Supplementary Materials)

References

[1] R N Bracewell The fourier transform and iis applications vol5 New York NY USA 1965

[2] Theory and application of digital signal processing vol 1Prentice-Hall Inc Englewood Cliffs NJ USA 1975

[3] R A Brooks and G Di Chiro ldquoTheory of image reconstructionin computed tomographyrdquo Radiology vol 117 no 3 I pp 561ndash572 1975

[4] L Moisan ldquoPeriodic plus smooth image decompositionrdquo Jour-nal of Mathematical Imaging and Vision vol 39 no 2 pp 161ndash179 2011

[5] B Akin P A Milder F Franchetti and J C Hoe ldquoMemorybandwidth efficient two-dimensional fast Fourier transformalgorithm and implementation for large problem sizesrdquo inProceedings of the 20th IEEE International Symposium on Field-Programmable Custom Computing Machines FCCM 2012 pp188ndash191 Canada May 2012

[6] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[7] H Kee N Petersen J Kornerup and S S BhattacharyyaldquoSystematic generation of FPGA-based FFT implementationsrdquoin Proceedings of the 2008 IEEE International Conference onAcoustics Speech and Signal Processing ICASSP pp 1413ndash1416USA April 2008

[8] F Mahmood M Toots L-G Ofverstedt and U Skoglundldquo2DDiscrete Fourier Transformwith simultaneous edge artifactremoval for real-Time applicationsrdquo in Proceedings of the Inter-national Conference on Field Programmable Technology FPT2015 pp 236ndash239 New Zealand December 2015

[9] M Frigo and S G Johnson ldquoFFTW an adaptive software archi-tecture for the FFTrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing vol 3 pp1381ndash1384 IEEE May 1998

[10] M Puschel J M F Moura J R Johnson et al ldquoSPIRAL codegeneration for DSP transformsrdquo Proceedings of the IEEE vol 93no 2 pp 232ndash273 2005

[11] EWang Q Zhang B Shen et alHigh-Performance Computingon the Intel Xeon Phi Springer International Publishing 2014

[12] B Akin F Franchetti and J C Hoe ldquoFFTS with near-optimalmemory access through block data layoutsrdquo in Proceedings ofthe 2014 IEEE International Conference onAcoustics Speech andSignal Processing ICASSP 2014 pp 3898ndash3902 Italy May 2014

[13] B Akin F Franchetti and J C Hoe ldquoUnderstanding thedesign space of DRAM-optimized hardware FFT acceleratorsrdquoin Proceedings of the 25th IEEE International Conference onApplication-Specific Systems Architectures and Processors ASAP2014 pp 248ndash255 Switzerland June 2014

[14] C-L Yu K Irick C Chakrabarti and V Narayanan ldquoMul-tidimensional DFT IP generator for FPGA platformsrdquo IEEETransactions on Circuits and Systems I Regular Papers vol 58no 4 pp 755ndash764 2011

[15] DHe andQ Sun ldquoApractical print-scan resilientwatermarkingschemerdquo in Proceedings of the IEEE International Conference onImage Processing 2005 ICIP 2005 pp 257ndash260 Italy September2005

[16] D G Bailey Design for embedded image processing on FPGAsJohn Wiley Sons 2011

[17] B G Batchelor ldquoImplementing Machine Vision Systems UsingFPGAsrdquo in Machine Vision Handbook 1136 p 1103 SpringerLondon UK 2012

[18] T Lenart M Gustafsson and V Owall ldquoA hardware accelera-tion platform for digital holographic imagingrdquo Journal of SignalProcessing Systems vol 52 no 3 pp 297ndash311 2008

[19] G H Loh ldquo3D-Stacked Memory Architectures for Multi-coreProcessorsrdquo ACM SIGARCH Computer Architecture News vol36 no 3 pp 453ndash464 2008

[20] Z Qiuling B Akin H E Sumbul et al ldquoA 3D-stacked logic-in-memory accelerator for application-specific data intensivecomputingrdquo in Proceedings of the IEEE International 3D SystemsIntegration Conference pp 1ndash7 October 2013

[21] I S Uzun A Amira and A Bouridane ldquoFPGA implementa-tions of fast Fourier transforms for real-time signal and imageprocessingrdquo pp 283ndash296

[22] T Dillon ldquoTwo Virtex-II FPGAs deliver fastest cheapest besthigh-performance image processing systemrdquo rdquo Xilinx XcellJournal vol 41 pp 70ndash73 2001

[23] A Hast ldquoRobust and invariant phase based local featurematchingrdquo in Proceedings of the 22nd International Conferenceon Pattern Recognition ICPR 2014 pp 809ndash814 Sweden August2014

[24] B Galerne Y Gousseau and J-M Morel ldquoRandom phasetextures theory and synthesisrdquo IEEE Transactions on ImageProcessing vol 20 no 1 pp 257ndash267 2011

[25] R Hovden Y Jiang H L Xin and L F Kourkoutis ldquoPeriodicArtifact Reduction in Fourier Transforms of Full Field AtomicResolution Imagesrdquo Microscopy and Microanalysis vol 21 no2 pp 436ndash441 2014

[26] D G Bailey ldquoThe advantages and limitations of high levelsynthesis for FPGA based image processingrdquo in Proceedings ofthe the 9th International Conference pp 134ndash139 Seville SpainSeptember 2015

[27] W Wang B Duan C Zhang P Zhang and N Sun ldquoAcceler-ating 2D FFT with non-power-of-two problem size on FPGArdquoin Proceedings of the 2010 International Conference on Recon-figurable Computing and FPGAs ReConFig 2010 pp 208ndash213Mexico December 2010

[28] F NattererTheMathematics of Computerized Tomography JohnWiley amp Sons 1986

[29] R A Crowther D J DeRosier and A Klug ldquoThe Reconstruc-tion of aThree-Dimensional Structure from Projections and itsApplication to Electron Microscopyrdquo Proceedings of the RoyalSociety A Mathematical Physical and Engineering Sciences vol317 no 1530 pp 319ndash340 1970

International Journal of Reconfigurable Computing 17

[30] U Skoglund L-G Ofverstedt R M Burnett and G BricogneldquoMaximum-entropy three-dimensional reconstruction withdeconvolution of the contrast transfer function A test applica-tion with adenovirusrdquo Journal of Structural Biology vol 117 no3 pp 173ndash188 1996

[31] F Mahmood N Shahid P Vandergheynst and U SkoglundldquoGraph-based sinogram denoising for tomographic reconstruc-tionsrdquo in Proceedings of the 38th Annual International Confer-ence of the IEEE Engineering in Medicine and Biology SocietyEMBC 2016 pp 3961ndash3964 USA August 2016

[32] F Mahmood N Shahid U Skoglund and P VandergheynstldquoAdaptive Graph-Based Total Variation for TomographicReconstructionsrdquo IEEE Signal Processing Letters vol 25 no 5pp 700ndash704 2018

[33] L A Shepp and B F Logan Jr ldquoThe Fourier reconstruction of ahead sectionrdquo IEEE Transactions on Nuclear Science vol 21 no3 pp 21ndash43 1974

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 6: Algorithm and Architecture Optimization for 2D Discrete ...downloads.hindawi.com/journals/ijrc/2018/1403181.pdf · Algorithm and Architecture Optimization for 2D Discrete Fourier

6 International Journal of Reconfigurable Computing

Step B Step C Step D

Step EStep A

DMAFIFODMA

FIFO

DMAFIFO

IO

Host PCFPGA 1

FPGA 2

SmartImagingDevice Host PC

2D FFT(RCD with DRAM

IntermediateStorage)

2D FFT(RCD with DRAM

IntermediateStorage)

ComputePeriodicBoarder

ComputeSmooth

Component

ComputePeriodic

Component

Figure 4 A top-level architecture for OPSD using two FPGAs and a host PC connected over a high-bandwidth bus The steps are associatedwith Algorithm 1

and

119908푘 = exp(minus1198942120587119899 )푘 = exp(minus1198942120587119896119899 ) (3)

119881 has the same structure as119882 but is m-dimensional 119908푘 hasperiod 119899 which means that 119908푘 = 119908푘+푙푛 forall119896 119897 isin N therefore

119882 = ((((((

1 1 1 1 11 119908 1199082 119908푛minus2 119908푛minus11 1199082 1199084 119908푛minus4 119908푛minus2 1 119908푛minus2 119908푛minus4 1199084 11990821 119908푛minus1 119908푛minus2 1199082 1199081))))))

(4)

Since in general 119868 is not (119899119898)-periodic there will behigh-amplitude edge artifacts present in the DFT stemmingfrom sharp discontinuities between the opposing edges ofthe image as shown in Figure 3(b) Reference [4] proposeda decomposition of 119868 into a periodic component 119875 whichis periodic and captures the essence of the image with allhigh-frequency details and a smoothly varying background119878 which recreates the discontinuities at the borders so 119868 =119875+ 119878 Periodic plus smooth decomposition can be computedby first constructing a border image 119861 = 119877 + 119862 where 119877represents the boundary discontinuities when transitioningrow-wise and 119862when going column-wise

119877 (119894 119895)= 119868 (119899 minus 1 minus 119894 119895) minus 119868 (119894 119895) 119894 = 0 or 119894 = 119899 minus 10 otherwise

119862 (119894 119895)= 119868 (119894 119898 minus 1 minus 119895) minus 119868 (119894 119895) 119895 = 0 or 119895 = 119898 minus 10 otherwise

(5)

It is obvious that the structure of the border image119861 is simplewith nonzero values only in the edges as shown below

119861 = 119877 + 119862= (((

11988711 11988712 1198871푚minus1 1198871푚11988721 0 0 minus11988721 119887푛minus11 0 0 minus119887푛minus11119887푛1 minus11988712 minus1198871푚minus1 minus119887푛푚)))

(6)

The DFT of the smooth component 119878 can be then found bythe following formula

(119904 119905) = (119904 119905)2 cos (2120587119904119899) + 2 cos (2120587119905119898) minus 4 forall (119904 119905) isin Ω (0 0) (7)

The DFT of the image 119868 with edge artifacts removed isthen = minus Figures 3(c) and 3(d) show the DFT ofthe smooth and periodic components respectively Figures3(e) and 3(f) show the reconstructed periodic and smoothcomponents On reconstruction it is evident that there isnegligible visual difference between the actual image and theperiodic reconstructed image for this example

31 Profiling PSD for FPGA Implementation Algorithm 1summarizes the overall PSD implementation There areseveral ways of arranging the algorithm We have arrangedit so that DFTs of the periodic and smooth componentsare readily available for further processing stages For bestresults both the periodic and smooth components shouldundergo similar processing stages and should be added backtogether before displaying the result However dependingon the application it might be acceptable to discard theperiodic component completely For an 119899 times 119898 image stepsA and C have a complexity of O(119899119898 log(119899119898)) and steps Band D have complexity O(119898 + 119899) and O(119898119899) respectivelyComputationally the performance of PSD is limited by stepsA and C Figure 4 shows a proposed top-level architecture

International Journal of Reconfigurable Computing 7

where step A and steps B C and D are completed on separateFPGAs while step E can be done on the host PC Two highend FPGAs are used instead of one because resources onone FPGA are insufficient to compute a large size 2D FFTas well its edge artifact removal components The overallperformance may be limited by FPGA 2 where most of theserial part of the algorithm lies There are two major factorswhich limit the throughput of such a design

(1) While FPGA 1 and FPGA 2 can run in parallel theresult of step A from FPGA 1 has to be stored on thehost PC while steps B C and D are completed onFPGA 2 before step E can be completed on the hostPC

(2) The DRAM intermediate storage problem explainedin Section 22 and Figures 1 and 2 has to be addressedsince strided access to the DRAM for column-wiseoperations can significantly limit throughput

As for (1) it has been addressed in the next section wherewe make use of the inherent symmetry of the boundaryimage to reduce the time required to compute the 2D FFTof the boundary image As for (2) it has been addressed bydesigning a semi-custommemory mapping controller whichrdquotilesrdquo the DRAM floor and rdquohopsrdquo between several tiles so asto minimize strided memory access

4 Proposed Optimized Periodic Plus SmoothDecomposition (OPSD)

In this section we optimize the original PSD algorithmThisoptimization is to effectively reduce the number of 1D FFTinvocations and the number of times the DRAM is accessed

Equation (6) shows that except the corners the boundaryimage 119861 is symmetrical in the sense that boundary rows andcolumns are algebraic negation of each other In total119861has 119899+119898 minus 1 unique elements with the following relations betweencorners with respect to columns and rows

11988711 = 11990311 + 119888111198871푚 = 1199031푚 minus 11988811119887푛1 = minus11990311 + 119888푛1119887푛푚 = minus1199031푚 minus 119888푛1(8)

997904rArr 119887푛푚 = minus11988711 minus 1198871푚 minus 119887푛1 (9)

In computing the FFTof119861 one normally proceeds by firstrunning 1D FFTs column-by-column and then 1D FFTs row-by-row or vice versa An FFT of a column vector 119907with length119899 is119882119907 where119882 is given in (4)The column-wise FFT of thematrix 119861 is then

=119882119861 (10)

Let us have a closer look at the first column denoted by 119861sdot1The 1D FFT of this vector is

sdot1 =119882119861sdot1

= (((((((

1 1 11 119908 119908푛minus11 1199082 1199082(푛minus1) 1 119908푛minus2 119908(푛minus2)(푛minus1)1 119908푛minus1 119908(푛minus1)(푛minus1)

)))))))

((((((

119887111198872111988731 119887푛minus11119887푛1

))))))

=((((((((((((((((

푛sum푖=1

119887푖1푛sum푖=1

119887푖1119908푖minus1푛sum푖=1

119887푖11199082(푖minus1) 푛sum푖=1

119887푖1119908(푛minus2)(푖minus1)푛sum푖=1

119887푖1119908(푛minus1)(푖minus1)

))))))))))))))))

(11)

It can be shown that the 1D FFT of the column 119895 isin2 3 119898 minus 1 issdot119895 =119882119861sdot119895

= (((((((

1 1 11 119908 119908푛minus11 1199082 1199082(푛minus1) 1 119908푛minus2 119908(푛minus2)(푛minus1)1 119908푛minus1 119908(푛minus1)(푛minus1)

)))))))

((((((

1198871푗00 0minus1198871푗

))))))

= 1198871푗((((((

01 minus 119908푛minus11 minus 119908푛minus2 1 minus 11990821 minus 119908

))))))

= 1198871푗^

(12)

8 International Journal of Reconfigurable Computing

Input 119868(119894 119895) of size 119899 times 119898Output (119904 119905) (119904 119905)

Step A Compute the 2D DFT of image 119868(119894 119895)1 119868(119894 119895) F997888rarr (119904 119905)Step B Compute periodic border 119861

2 while 1 lt 119895 lt 119898 do3 while 1 lt 119894 lt 119899 do4 if (119894 = 0 or 119894 = 119899 minus 1) then5 119877(119894 119895) larr997888 119868(119899 minus 1 minus 119894 119895) minus 119868(119894 119895)6 else7 119877(119894 119895) larr997888 08 end if9 if (119895 = 0 or 119895 = 119898 minus 1) then10 119862(119894 119895) larr997888 119868(119894 119898 minus 1 minus 119895) minus 119868(119894 119895)11 else12 119862(119894 119895) larr997888 013 end if14 end while15 end while16 119861larr997888 119877 + 119862Step C Compute the 2D DFT of (119904 119905)

17 119861(119894 119895) F997888rarr (119904 119905)Step D Compute the Smooth Component (119904 119905)

18 (119904 119905) larr997888 2 cos(2120587119904119899) + 2 cos(2120587119905119898) minus 419 (119904 119905) larr997888 (119904 119905) divide (119904 119905)

Step E Compute the Periodic Component (119904 119905)20 (119904 119905) larr997888 (119904 119905) minus (119904 119905)21 return (119904 119905) (119904 119905)

Algorithm 1 Periodic plus smooth decomposition (PSD)

and the 1D FFT of the last column 119861sdot119898 is

sdot119898 =119882119861sdot119898 (13)

=((((((((((((((((

minus 푛sum푖=1

119887푖1minus 푛sum푖=1

119887푖1119908푖minus1 + (11988711 + 1198871푚) (1 minus 119908푛minus1)minus 푛sum푖=1

119887푖11199082(푖minus1) + (11988711 + 1198871푚) (1 minus 1199082(푛minus1)) minus 푛sum푖=1

119887푖1119908(푛minus2)(푖minus1) + (11988711 + 1198871푚) (1 minus 119908(푛minus2)(푛minus1))minus 푛sum푖=1

119887푖1119908(푛minus1)(푖minus1) + (11988711 + 1198871푚) (1 minus 119908(푛minus1)(푛minus1))

))))))))))))))))

(14)

sdot119898 = minussdot1 + (11988711 + 1198871푚) ^ (15)

Therefore the column-wise FFT of the matrix 119861 is

= (119861sdot1 11988712^ 1198871푚minus1^ minussdot1 + (11988711 + 1198871푚) ^) (16)

To compute the column-by-column 1DFFT of thematrix119861 we only have to compute the FFT of the first vectorand then use the appropriately scaled vector ^ to derivethe remainder of the columns The row-by-row FFT has

to be calculated in a row burst normal way Algorithm 2presents a summary of the shortcut for calculating (119904 119905)The steps presented in Algorithm 2 can replace step Cin Algorithm 1 By reducing column-by-column 1D FFTcomputations for the boundary image this method cansignificantly reduce the number of 1D FFT invocationsreduce the overall DRAM access and eliminate problematiccolumn-wise strided DRAM access for an efficient FPGA-based implementation For column-wise operations a single1D FFT of size 119898 is required rather than 119899119898 1D FFTs of size119898 Moreover since one has to simply store one column ofdata it can be stored on the on-chip local memory (BRAMor SRAM) This can be implemented by temporarily storingthe initial vector sdot1 and scaling factors 1198871푗 in the blockRAMregister memory drastically reducing DRAM accessand lowering the number of required 1D FFT invocationsPerformance evaluation for this has been presented in theresults section

Table 1 shows a comparison of mirroring PSD andour proposed OPSD with respect to DRAM access pointsMirroring has been used for comparison purposes because itis an alternative technique that reduces edge artifacts whilemaintaining maximum amplitude information Howeverdue to replication of the imagemost of the phase informationis lost Figure 5 graphically shows that our OPSDmethod sig-nificantly reduces reading fromexternalmemory and reduces

International Journal of Reconfigurable Computing 9

Input 119861(119894 119895) of size 119899 times 119898Output (119904 119905)2DF(119861) lArrrArr 119861 F

119888119908997888997888997888rarr 119888119908 F119903119908997888997888997888rarr

Column-by-Column DFT via Symmetrical Short-cut1 119861sdot1

F997888rarr sdot12 while 1 lt 119895 lt 119898 do3 sdot119895 larr997888 1198871푗^4 end while5 sdot119898 larr997888 minussdot1 + (11988711 + 1198871푚)^6 119888119908 larr997888 119862119900119899119888119886119905119890119899119886119905119890 sdot1 sdot2 sdot119898minus1 sdot119898

Row-by-Row DFT7 119888119908

F119903119908997888997888997888rarr (119904 119905)

8 return (119904 119905)cw column-wisecolumn-by-columnrw row-wiserow-by-row

Algorithm 2 Proposed symmetrically optimized computation of (119904 119905)

DRA

M A

cces

s Poi

nts

2

18

16

14

12

1

08

06

04

02

0

Image Size (N=M)0 1000 2000 3000 4000 5000

DRAM Access - PSD vs OPSD

MirroringPSDOPSD (Proposed)

times108

Figure 5 Graph showing DRAM access (equal to number of DFT points to be computed) with increasing image size for mirroring periodicplus smooth decomposition (PSD) and our proposed optimized period plus smooth decomposition (OPSD)

the overall number of DFT computations required It shouldbe noted that such optimization is only possible for eithercolumn-wise or row-wise operations because after either ofthese operations the output is not symmetrical anymoreCompleting the column-wise operation first prevents stridedreading however this results in strided writing to the DRAMbefore row-wise traversal can startThis can be minimized bymaking efficient use of the local block RAM Output columnsare stored in the block RAM before being written to theDRAM in patches such that each row buffer access writeselements in several columns

5 Proposed Tile-Hopping Memory Mappingfor 2D FFTs

In this section we propose a tile-hopping external memoryaccess pattern for efficiently addressing external memory

during intermediate storage between row-wise and column-wise 1D FFT operations to calculate a 2D FFT As explainedin Section 22 and Figure 2 column-wise reads from DRAMcan be costly due to the overhead associated with activatingand precharging In the worst case scenario it can limitDRAM bandwidth up to 80 [26] This is a problem withall such image processing operations where one stage of theprocessing has to be completed on all elements before thenext stage can start In the past there have been severalimplementations using localmemory however with growingdemand for larger image sizes external memory has to beused There have been several DRAM remapping attemptsbefore such as [5 13] They propose a tile-based approachwhere an 119899 times 119899 image (input array) is divided into 119899119896 times 119899119896tiles where 119896 is the size of the DRAM row buffer whichallows for very high-bandwidthDRAMaccess Although this

10 International Journal of Reconfigurable Computing

Table 1 Comparing mirroring PSD and OPSD

Algorithm DRAM Access DFTPoints Points

Mirroring 8119873119872 8119873119872P + S Decomposition (PSD) 4119873119872 4119873119872Optimized PSD (Proposed) 3119873119872 +119873 +119872 minus 1 3119873119872 +119872method may be ideal to maximize the DRAM performancefor 2D FFTs it incurs a high resource cost associated withlocal memory transposition and storing large chunks of data(entire rowcolumn of tiles) in the local memory Moreovertiling in the image domain also requires remapping row-by-row operations Another approach to reducing stridedDRAM access has been presented in [27] They present a2D decomposition algorithmwhich decomposes the probleminto smaller subblock 2D FFTs which can be performedlocallyThis introduces extra row and columndata exchangesand total number of operations are increased fromO(1198992 log 119899)toO(1198992(1+log 119899)) Other implementations do not address theexternal memory issue in detail

We propose tile-hopping address mapping which reducesthe number of row activations required to access a singlecolumn Unlike [5] our approach does not require significantlocal operations or storage The proposed memory mappingcontroller was designed on top of LabVIEW FPGArsquos existingmemory controller which efficiently controls interleaving andissues activation and precharge commands in parallel withdata transfer The reduced number of row activations alsoreduces the amount of energy required by the DRAM Thiswill be further discussed in the energy evaluation presentedin Section 64 and Table 3 where we demonstrate that theproposed tile-hopping address mapping can reduce energyconsumption up to 53

Instead of writing the results of the row-by-row 1D FFTin row-major order we remap the results in a blocked or tiledpattern as shown in Figure 6This means that when accessingan image column several elements of that column can beretrieved from a single DRAM row access For an 119899times119899 imageeach row of size 119899 can be divided into ℎ tiles (ie 119899 = ℎ119873(119905)where119873(119905) is the number of elements in each tile)These tilescan be remapped onto the DRAM floor as shown in Figure 6If the size of the row is small enough it may be possible toconvert it into a single tile (ie ℎ = 1) However this isunlikely for realistic image sizes For a tile of size119901times119902 a singlerow of the image is written into the DRAM by transitioningthrough 119901ℎ rows If 119896 is the size of the row buffer thereare 119896119902 distinct tiles represented in each DRAM row and itcontains the same number of elements from a single imagecolumn Given regular row-major storage when accessingcolumn-wise elements one would have to transition through119899 DRAM rows to read a single image column Howeverwith this approach when accessing an image column 119896119902elements of that column could be read from a single DRAMrow which has been referred to in the row buffer Althoughthe cost of writing an image row is higher when compared toa standard row-major DRAM writing pattern (ie referring

to119901ℎ rather than 119899 rows) the number ofDRAMrow referralsduring column-wise read is reduced to 119899119902119896 which is 119899(1 minus119902119896) less row referrals for a single column read

We refer to this method as tile-hopping because it entailsmapping data onto several DRAM tiles and then hoppingbetween the tiles such that several elements of the imagecolumn exist in a DRAM row which has been referred toin the row bank Although this mapping scheme has beendeveloped for column-wise access required during 2D FFTcalculation the scheme is general and can be adapted to otherapplications Performance evaluation of thismethod has beenpresented in the experiments section

6 Experimental Results and Analysis

61 Hardware Configuration and Target Selection Since 2DDFTs are usually used for simplifying convolution operationsin complex image processing andmachine vision systems weneeded to prototype our design on a system that is expandablefor next levels of processing As mentioned earlier for rapidprototyping of our proposed OPSD algorithm and tile-hopping memory mapping scheme we used a PXIe-basedreconfigurable system PXIe is an industrial extension ofa PCI system with an enhanced bus structure that giveseach connected device dedicated access to the bus with amaximum throughput of 24GBs This allows a high-speeddedicated link between a host PC and several FPGAs TheLabVIEWFPGAgraphical design environment is efficient forrapid prototyping of complicated signal and image processingsystems It allows us to effectively integrate external HDLcode and LabVIEW graphical design on a single platformMoreover it allows a combination of high-level synthesis(HLS) and custom logic Since current HLS tools havelimitations when it comes to complex image and signalprocessing tasks LabVIEW FPGA tries to bridge these gapsby streamlining the design process

We used FlexRIO (Flexible Reconfigurable IO) FPGAboards plugged into a PXIe chassis PXIe FlexRIO FPGAboards are adaptable and can be used to achieve highthroughput because they allow direct data transfer betweenmultiple FPGAs at rates as high as 8GBs This can signifi-cantly simplify multi-FPGA systems which usually commu-nicate via a host PC This feature allows expansion of oursystem to further processing stages making it flexible for avariety of applications Figure 7 shows a basic overview ofa PXIe-based multi-FPGA system with a host PC controllerconnected through a high-speed bus on a PXIe chassis

Specifically we used two NI PXIe-7976R FlexRIO boardswhich have Kintex 7 FPGA and 2GB external DRAM withtheoretical data bandwidth up to 10GBs This FPGA boardwas plugged into a PXIe-1085 chassis along with a PXIe-8880Intel Xeon PC controller PXIe-1085 can hold up to 16 FPGAsand has 8 GBs per-slot dedicated bandwidth and an overallsystem bandwidth of 24 GBs

62 Experimental Setup As per Algorithms 1 and 2 dis-cussed in previous sections implementation involves fivestages (A) calculating the 2D FFT of an image frame (B)

International Journal of Reconfigurable Computing 11

n

n

Row-by-Row (RR) Output

(a) Dataset view

k

p

q

Writing RR Output to DRAM

Row Buffer

(b) Memory view

k

Reading from DRAM

Row Buffer

(c) Memory view

Figure 6 Image showing tile-hopping (a) Image-level view showing tiles (b) DRAM-level view showing tile placement while writing (c)DRAM-level view showing column reading from the tiles

External IO PXIe

Host PC ControllerDisplay

Device

PXIeFPGA Board 1

High-throughput PXIe BUSPXIe Chassis

External IO

PXIeFPGA Board 2

External IO

PXIeFPGA Board 3

External IO

Figure 7 Block diagram of a PXIe-based multi-FPGA system witha host PC controller connected through a high-speed bus on a PXIechassis [8]

calculating the boundary image (C) calculating the 2DFFT of the boundary image (D) calculating the smoothcomponent and (E) subtracting the smooth component fromthe 2D FFT of the original image to achieve the periodiccomponent The bottleneck consistently occurs in A and theserial part of the algorithm (A 997888rarr B 997888rarr C) The limitation

due to A is reduced removing the so-called ldquomemory wallrdquoby using our proposed tile-hopping-based memory mappingThe limitations due to the serial part of the algorithm arereduced by using OPSD rather than PSD For quantificationthe delay for A is 062ms for a 512 times 512 image

The design flow presented in Figure 4 was followedData-flow is clearly shown in a graphical programmingenvironment making it easier to visualize how a designefficiently fits on an FPGA Highly efficient implementationsof 1D FFT were used from LabVIEW FPGA for parallelrow-by-row operations and by integrating Xilinx LogiCOREfor column-by-column operations Each stage of the designwas dynamically tested and benchmarked The image wasstreamed from the host PC using a Direct Memory Access(DMA) FIFO 1D FFTs are performed in parallel rows of 8and stored in the DRAM via local memory in a tiled patternas explained in the previous section This follows readingseveral rows to extract a single column which is Fourier-transformed using Xilinx LogiCORE and is sent back to

12 International Journal of Reconfigurable Computing

I(00)

I(n-1 m-1)

i

j

N x M size image

a b

a

1D FFT

c

Ori

gina

l Im

age

Boun

dary

Imag

e

TrigonometricFunctions

Equation (7)

FFT ofSmooth Component

FFT of Periodic Component(With Edge Artifact

Removed)

b

Shortcut for Column-wise 1D FFT

Exte

rnal

Mem

ory

Exte

rnal

Mem

ory

Row-wise 1D FFT

c

Bloc

k M

emor

y

Exte

rnal

Mem

ory

Row-wise 1D FFT

Host PC Controller(NI PXIe-8880)

FPGA-1(NI PXIe-7976R)

Host PC Controller (NI PXIe-8880)

PXIe Chassis (NI-PXIe1085)

FPGA-2(NI PXIe-7976R)Block Memory

Bloc

k M

emor

y

+

minusColumn-wise 1D FFTlowast

Row-Major Memory Mapping Tile-Hopping Memory Mapping

I(n-1 j) ndash I(0 j)

I(i

0) ndash

I(i

m-1

)

I(i

m-1

) ndash I(

i 0)

I(0 j) ndash I(n-1 j)

middot1

b1jv

-middot1 + (b11+b1m)v

Figure 8 Functional block diagramof PXIe-based 2DFFT implementationwith simultaneous edge artifact removal using optimized periodicplus smooth decomposition The OPSD algorithm is split among two NI-7976R (Kintex-7) FPGA boards with 2GB external memory and ahost PC connected over a high-bandwidth bus The image is streamed from the PC controller to FPGA 1 and FPGA 2 FPGA 1 calculates therow-by-row 1D FFT followed by column-by-column 1D FFT with intermediate tile-hopping memory mapping and sends the result back tothe host PC FPGA 2 receives the image calculates the boundary image and proceeds to calculate the 1D FFT column-by-column FFT usingthe shortcut presented in (16) followed by row-by-row 1D FFTs and the result is sent back to the host PC

Camera

HostPC

External Memory

FPGA Board 1

FPGA

DMAFIFO

FIFO IN(Local Memory - (Local Memory ndash

Write)

FIFO OUT

Read)

PXIe

BU

S

CU

1D FF4H

1D FF41

1D FF42

Figure 9 Block diagram of 2D FFT showing data transfer betweenexternal memory and local memory scheduled via a Control Unit(CU)

the host PC If the image is being streamed directly froman imaging device which scans and provides random ornonlinear sequence of rows it is necessary to store a frameof the image in a buffer This can also be accomplishedby streaming the image flow from the host PC or using asmart camera which can delay image delivery by a singleframe Local memory shown in Figure 9 is used to bufferdata between external memory and 1D FFT cores Thismemory is divided into read and write components and isimplemented using FPGA slices Block RAM (BRAM) is usedfor temporary storage of vectors required for calculating the2D FFT of the boundary image (in the case of FPGA 2)The Control Unit (CU) organizes scheduling for transferring

data between local and external memory CU is based onLabVIEWrsquos existing memory controller and our memorymapping scheme presented in Section 5

Step B was accomplished using standard LabVIEWFPGA HLS tools for programming (5) using the graphicalprogramming environment In step C the 2D FFT of theboundary image needs to be calculated by row and columndecomposition However as shown mathematically in theprevious section the initial row-wise FFTs can be calculatedby computing the 1D FFT of the first (boundary) vector andthe FFTs of reaming vectors can be computed by appropriatescaling of this vector

We need the boundary column vector for 1D FFT calcula-tion of the first and last columns We also need the boundaryrow vector for appropriate scaling of for the 1D FFT ofevery column between the first and last columns Row andcolumn vectors of the boundary image are stored in blockRAM (BRAM) Figure 8 shows a functional block diagramof the overall 2D FFT with optimized PSD process Steps Dand E are performed on the host PC to minimize memoryclashes and to access the periodic and smooth componentsof each frame as they become available 86 of resources areused on FPGA 1 and 41 resources are used on FPGA 2 Theresource utilization is reported according to LabVIEWFPGAsynthesis and compilation experiments It should be notedthat part of the reason for high resource utilization is becauseof using LabVIEW FPGA high-level synthesis tools as wellas Xilinx LogiCORE tools Using standardized tools makesit harder to optimize for resource utilization The currentimplementation was optimized for performance in terms ofrun time and energy consumption

International Journal of Reconfigurable Computing 13

Fram

esS

econ

d (1

s)

105

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7)Kintex-7 (No DRAM Opt)Kintex-7 (DRAM Opt)Theoretical Peak

(a) 2D FFT performance DRAM tile-hopping

Fram

esS

econ

d (1

s)

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7) (PSD)Kintex-7 (PSD)Kintex-7 (OPSD)

(b) 2D FFT with EAR performance PSD versus OPSD

Figure 10 Performance evaluation in terms of frames per second for (a) 2D FFTs with tile-hopping memory pattern (b) 2D FFTs with edgeartifact removal (EAR) using OPSDThe performance evaluation shows the significance of the two optimizations proposed Both axes are ona log scale

63 Performance Evaluation The overall performance of thesystem was evaluated using the setup presented in Figure 9The data was streamed from the host PC in certain caseshigh frame rate videos as well as direct camera input werestreamed from the host All results presented are for 16-bitfixed-point precision Figure 10(a) presents the effectivenessof our proposed tile-hopping memory mapping scheme Itclearly shows the effectiveness of our proposed memorymapping since it is closer to the theoretical peak performanceFigure 10(b) presents the overall results comparing PSD andOPSD and demonstrating the effectiveness of our proposedoptimization PSD was also implemented on the same plat-form but the optimization presented in Section 4 was notused This rendered the serial portion of the algorithm to bethe bottleneck which reduced overall performance Table 2shows a comparison of our implementation in contrast torecent 2D FFT FPGA implementation approaches and showsthat we achieve a better performance even with simultaneousedge artifact removal Although our implementation is testedwith 16-bit fixed-point precision which limits the accuracy ofthe transform the precision may be sufficient for a varietyof speed critical applications where alternative edge artifactremoval methods (eg filtering) may decrease overall systemperformance Although the dynamic range of fixed-pointdata is smaller than floating-point data which can lead toerrors in the 2D FFT for certain applications it is moreimportant to have an artifact-free transform as compared toa highly accurate transform

64 Energy Evaluation The overall energy consumptionof the custom computing system depends on (1) powerperformance of the system components and (2) throughputdisparity Throughput disparity results in idle time for at

least one of the components and lowers the overall sys-tem throughput A throughput-optimized system minimizesinstances where certain components of the architecture areidle The proposed optimizations in Sections 4 and 5 clearlyreduce throughput disparity andminimize the idle time of thesystem Thus besides causing delays due to significant over-head standard column-wise DRAM access also contributesto the overall energy consumption This is not only due tohigh count of DRAM row charges but also because of energyconsumed by the FPGA in idle state Ideally maximizing theDRAMbandwidth limits the amount of energy consumptionProposed ldquotile-hoppingrdquo memory mapping scheme improvesthe DRAMbandwidth as seen in Figure 10 and hence reducesthe overall energy consumption The same is true for theproposed OPSD method where reduced DRAM access and1D FFT invocations lead to reduced energy consumption Inthis section we analyze the amount of improvement in energyconsumption based on the proposed optimizations

We estimate the DRAM power consumption for boththe baseline (standard strided) and the optimized (ldquotile-hoppingrdquo) memory access using MICRON DRAM powercalculator The energy is calculated in 119899119869 for each readie energy per read This is accomplished by calculatingthe run time for a specific 2D FFT and estimating theamount of energy consumed by the DRAM power calculatorTable 4 depicts the DRAM energy consumption for 2D FFTsbefore and after the proposed ldquotile-hoppingrdquo optimization forcolumn-wise DRAM access As mentioned earlier row-wiseDRAM access is fast and row buffer size data can be accessedby a single row activation According to Table 4 the energyrequired for DRAM access is reduced by 427 488 and529 for 1024 times 1024 2048 times 2048 and 4096 times 4096 size 2DFFTs respectively

14 International Journal of Reconfigurable Computing

Table 2 Comparison of OPSD1 2D FFT with regular RCD-based implementation

Platform SEAR2 Precision RT3 (ms)YesNo bits 512 times 512 1024 times 1024

Kintex 7 28nm (ours) Yes 16 (fixed) 15 48Kintex 7 28nm (ours) No 16 (fixed) 09 41Kintex 7 28mm [8] Yes 16 (fixed) 324 1167Stratix IV [5] No 64 (double) - 61Virtex-5-BEE3 65nm[14] No 32 (single) 249 1026Virtex-E 180nm [21] No 16 (fixed) 286 769ASIC 180nm No 32 (single) 210 -1 Optimized periodic + smooth decomposition (OPSD)2 Simultaneous edge artifact removal3 Runtime (ms)

Table 3 DRAM energy consumption baseline vs tile-hopping1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPRlowast CW∘ Read(Baseline) 446 577 712EPR CW ReadTile-Hopping 254 295 336(Proposed)Reduction () 427 488 529lowastEnergy per read (EPR)∘Column-wise memory access (CW)

Table 4 2D FFT + EAR energy consumption baseline vs optimized (OPSD + tile hopping)1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPPdagger 2D FFT + EARBaseline 3692 4125 48352D FFT+EAR (Opt)Tile-Hopping + OPSD 1588 1706 1811(Proposed)Improvement 23times 24times 28timesdaggerEnergy per point (EPP)

The metric used to compare the overall energy optimiza-tion achieved for 2D FFTs with EAR is energy per point iethe amount of average energy required to compute the 2DFFT of a single point in an image with simultaneous edgeartifact removal This was achieved by calculating the energyconsumed by Xilinx LogiCORE IP for 1D FFTs the DRAMand the edge artifact removal part separately The estimatedenergy calculated does not include energy consumed by thePXIe chassis and the host PC Essentially the FPGA-basedarchitecture presented here could be used without the hostcontroller The energy consumption incorporates dynamicas well as static power The overall energy consumption perpoint is reduced by 569 586 and 62 for calculating1024 times 1024 2048 times 2048 and 4096 times 4096 size 2D FFTs withEAR respectively

7 Application Filtered Back-Projectionfor Tomography

In order to further demonstrate the effectiveness of ourimplementation we use the created 2D FFT module asan accelerator for reducing the run time for filtered back-projection (FBP) FBP is a fundamental analytical tomo-graphic image reconstruction method In depth detailsregarding the basic FBP algorithm have been left out forbrevity but can be found in [28 29]Themethod can be usedto reconstruct primitive 3D tomograms from 2D data whichcan then be used as a basis for more complex regularization-basedmethods such as [30ndash32]The algorithmic flow is basedon the Fourier slice theorem ie 2D Fourier transformsof projections are an angular component of the 3D Fourier

International Journal of Reconfigurable Computing 15

Table 5 Comparing filtered back-projection (FBP) runtime (as an application for using the proposed 2D FFT with simultaneous EAR)

3D Density CPU (i7) FPGA + Host PC (i7)Sec Sec128 times 128 times 128 213 sec 195 sec256 times 256 times 256 475 sec 424 sec512 times 512 times 512 948 sec 813 sec1024 times 1024 times 1024 3223 sec 2753 sec2048 times 2048 times 2048 16877 sec 13644 sec4096 times 4096 times 4096 164631 sec 125994 sec

Original Phantom CPU Reconstructed Phantom FPGA+CPU Reconstructed Ph

Inte

nsity

Inte

nsity

1

08

06

04

02

0

1

08

06

04

02

0

X Direction0 20 40 60 80 100 120

X Direction0 20 40 60 80 100 120

Original PhantomCPU BP

Original PhantomFPGA+CPU BP

Figure 11 Figure showing a thin slice of filtered back-projectionresults by reconstructing a 128 times 128 times 128 Shepp-Logan phantomThe 3Ddensitywas reconstructed from 180 equally spaced simulatedprojections using standard linearly interpolated FBP and using aRam-Lak filter It can be seen that the results from the FPGA + CPUsolution have some errors this is due to the fact that the 2D FFT ofeach projection is less accurate The results are good enough to beused as a basis for further optimization based refinement methods

transform of the 3D reconstructed volume Our 2D FFTaccelerator was used to calculate the 2D FFTs of the projec-tions as well as for initial stages of the 3D FFTwhich was thencompleted on the host PC Similar to the 2D FFT the 3D FFTis separable and can be divided into 2D FFTs and 1D FFTsThe results have been shown in Table 5 It can be seen thatthe improvement for smaller size densities is not significantbecause their FFTs are quite fast on general-purpose CPUsHowever for larger densities the FFT accelerator can give asignificant improvement If the remaining components arealso implemented on an FPGA significant speed increasecan be achieved Results of a thin slice from a 3D simulatedShepp-Logan [33] phantom have been shown in Figure 11 Itcan be seen that the results from the hardware acceleratedFBP are of slightly lower quality This is due to the factthat our 2D FFT implementation is less accurate (16 bitfixed-point) as compared to the CPU-based implementation(FFTW double-precision floating) The accelerated FBP wasalso tested with real Electron Tomography (ET) data

8 Conclusion

2D FFTs often become a major bottleneck for high-performance imaging and vision systems The inherentcomputational complexity of the 2D FFT kernel is fur-ther enhanced if effective removal (using PSD) of spuriousartifacts introduced by the nonperiodic nature of real-lifeimages is taken into accountWe developed and implementedan FPGA-based design for calculating high-throughput 2DDFTs with simultaneous edge artifact removal Our approachis based on a PSD algorithm that splits the frequency domainof a 2D image into a smooth component which contains thehigh-frequency cross-shaped artifacts and can be subtractedfrom the 2D DFT of the original image to obtain a periodiccomponent that is artifact-free Since this approach calculatestwo 2D DFTs simultaneously external memory addressingand repeated 1D FFT invocations become problematic Tosolve this problem we optimized the original PSD algorithmto reduce the number of DFT samples to be computed andDRAM access Moreover to reduce strided access from theDRAMduring column-wise reads we presented and analyzedldquotile-hoppingrdquo a memorymapping scheme which reduces thenumber of DRAM row activations when reading a singlecolumn of data This memory mapping scheme is generaland may be used for a variety of other applications Wedemonstrate that ldquotile-hoppingrdquomemorymapping can reducethe DRAM energy consumption by 529 Moreover weshow that the proposed optimizations lead to 28times less energyconsumption for the overall 2D FFT with EAR architecture

Our methods were tested using extensive synthesis andbenchmarking using a Xilinx Kintex 7 FPGA communicatingwith a host PC on a high-speed PXIe bus Our system isexpandable to support several FPGAs and can be adaptedto various large-scale computer vision and biomedical appli-cations Despite decomposing the image into periodic andsmooth frequency components our design requires lessrun time compared to traditional FPGA-based 2D DFTimplementation approaches and can be used for a varietyof highly demanding applications One such applicationfiltered back-projection was accelerated using the proposedimplementation to achieve better results specifically for largersize raw tomographic data

Conflicts of Interest

The authors declare that they have no conflicts of interest

16 International Journal of Reconfigurable Computing

Acknowledgments

Thisworkwas supported by JapaneseGovernmentOIST Sub-sidy for Operations (Ulf Skoglund) Grant no 5020S7010020FaisalMahmood andMart Toots were additionally supportedby the OIST PhD Fellowship The authors would like tothank National Instruments Research for their technicalsupport during the design process The authors would alsolike to thank Dr Steven D Aird for language assistance andShizuka Kuda for logistical arrangements

Supplementary Materials

File 2D FFT-EAR-Demomp4 video demonstrating theFPGA output of 2D FFT with edge artifact removal(Supplementary Materials)

References

[1] R N Bracewell The fourier transform and iis applications vol5 New York NY USA 1965

[2] Theory and application of digital signal processing vol 1Prentice-Hall Inc Englewood Cliffs NJ USA 1975

[3] R A Brooks and G Di Chiro ldquoTheory of image reconstructionin computed tomographyrdquo Radiology vol 117 no 3 I pp 561ndash572 1975

[4] L Moisan ldquoPeriodic plus smooth image decompositionrdquo Jour-nal of Mathematical Imaging and Vision vol 39 no 2 pp 161ndash179 2011

[5] B Akin P A Milder F Franchetti and J C Hoe ldquoMemorybandwidth efficient two-dimensional fast Fourier transformalgorithm and implementation for large problem sizesrdquo inProceedings of the 20th IEEE International Symposium on Field-Programmable Custom Computing Machines FCCM 2012 pp188ndash191 Canada May 2012

[6] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[7] H Kee N Petersen J Kornerup and S S BhattacharyyaldquoSystematic generation of FPGA-based FFT implementationsrdquoin Proceedings of the 2008 IEEE International Conference onAcoustics Speech and Signal Processing ICASSP pp 1413ndash1416USA April 2008

[8] F Mahmood M Toots L-G Ofverstedt and U Skoglundldquo2DDiscrete Fourier Transformwith simultaneous edge artifactremoval for real-Time applicationsrdquo in Proceedings of the Inter-national Conference on Field Programmable Technology FPT2015 pp 236ndash239 New Zealand December 2015

[9] M Frigo and S G Johnson ldquoFFTW an adaptive software archi-tecture for the FFTrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing vol 3 pp1381ndash1384 IEEE May 1998

[10] M Puschel J M F Moura J R Johnson et al ldquoSPIRAL codegeneration for DSP transformsrdquo Proceedings of the IEEE vol 93no 2 pp 232ndash273 2005

[11] EWang Q Zhang B Shen et alHigh-Performance Computingon the Intel Xeon Phi Springer International Publishing 2014

[12] B Akin F Franchetti and J C Hoe ldquoFFTS with near-optimalmemory access through block data layoutsrdquo in Proceedings ofthe 2014 IEEE International Conference onAcoustics Speech andSignal Processing ICASSP 2014 pp 3898ndash3902 Italy May 2014

[13] B Akin F Franchetti and J C Hoe ldquoUnderstanding thedesign space of DRAM-optimized hardware FFT acceleratorsrdquoin Proceedings of the 25th IEEE International Conference onApplication-Specific Systems Architectures and Processors ASAP2014 pp 248ndash255 Switzerland June 2014

[14] C-L Yu K Irick C Chakrabarti and V Narayanan ldquoMul-tidimensional DFT IP generator for FPGA platformsrdquo IEEETransactions on Circuits and Systems I Regular Papers vol 58no 4 pp 755ndash764 2011

[15] DHe andQ Sun ldquoApractical print-scan resilientwatermarkingschemerdquo in Proceedings of the IEEE International Conference onImage Processing 2005 ICIP 2005 pp 257ndash260 Italy September2005

[16] D G Bailey Design for embedded image processing on FPGAsJohn Wiley Sons 2011

[17] B G Batchelor ldquoImplementing Machine Vision Systems UsingFPGAsrdquo in Machine Vision Handbook 1136 p 1103 SpringerLondon UK 2012

[18] T Lenart M Gustafsson and V Owall ldquoA hardware accelera-tion platform for digital holographic imagingrdquo Journal of SignalProcessing Systems vol 52 no 3 pp 297ndash311 2008

[19] G H Loh ldquo3D-Stacked Memory Architectures for Multi-coreProcessorsrdquo ACM SIGARCH Computer Architecture News vol36 no 3 pp 453ndash464 2008

[20] Z Qiuling B Akin H E Sumbul et al ldquoA 3D-stacked logic-in-memory accelerator for application-specific data intensivecomputingrdquo in Proceedings of the IEEE International 3D SystemsIntegration Conference pp 1ndash7 October 2013

[21] I S Uzun A Amira and A Bouridane ldquoFPGA implementa-tions of fast Fourier transforms for real-time signal and imageprocessingrdquo pp 283ndash296

[22] T Dillon ldquoTwo Virtex-II FPGAs deliver fastest cheapest besthigh-performance image processing systemrdquo rdquo Xilinx XcellJournal vol 41 pp 70ndash73 2001

[23] A Hast ldquoRobust and invariant phase based local featurematchingrdquo in Proceedings of the 22nd International Conferenceon Pattern Recognition ICPR 2014 pp 809ndash814 Sweden August2014

[24] B Galerne Y Gousseau and J-M Morel ldquoRandom phasetextures theory and synthesisrdquo IEEE Transactions on ImageProcessing vol 20 no 1 pp 257ndash267 2011

[25] R Hovden Y Jiang H L Xin and L F Kourkoutis ldquoPeriodicArtifact Reduction in Fourier Transforms of Full Field AtomicResolution Imagesrdquo Microscopy and Microanalysis vol 21 no2 pp 436ndash441 2014

[26] D G Bailey ldquoThe advantages and limitations of high levelsynthesis for FPGA based image processingrdquo in Proceedings ofthe the 9th International Conference pp 134ndash139 Seville SpainSeptember 2015

[27] W Wang B Duan C Zhang P Zhang and N Sun ldquoAcceler-ating 2D FFT with non-power-of-two problem size on FPGArdquoin Proceedings of the 2010 International Conference on Recon-figurable Computing and FPGAs ReConFig 2010 pp 208ndash213Mexico December 2010

[28] F NattererTheMathematics of Computerized Tomography JohnWiley amp Sons 1986

[29] R A Crowther D J DeRosier and A Klug ldquoThe Reconstruc-tion of aThree-Dimensional Structure from Projections and itsApplication to Electron Microscopyrdquo Proceedings of the RoyalSociety A Mathematical Physical and Engineering Sciences vol317 no 1530 pp 319ndash340 1970

International Journal of Reconfigurable Computing 17

[30] U Skoglund L-G Ofverstedt R M Burnett and G BricogneldquoMaximum-entropy three-dimensional reconstruction withdeconvolution of the contrast transfer function A test applica-tion with adenovirusrdquo Journal of Structural Biology vol 117 no3 pp 173ndash188 1996

[31] F Mahmood N Shahid P Vandergheynst and U SkoglundldquoGraph-based sinogram denoising for tomographic reconstruc-tionsrdquo in Proceedings of the 38th Annual International Confer-ence of the IEEE Engineering in Medicine and Biology SocietyEMBC 2016 pp 3961ndash3964 USA August 2016

[32] F Mahmood N Shahid U Skoglund and P VandergheynstldquoAdaptive Graph-Based Total Variation for TomographicReconstructionsrdquo IEEE Signal Processing Letters vol 25 no 5pp 700ndash704 2018

[33] L A Shepp and B F Logan Jr ldquoThe Fourier reconstruction of ahead sectionrdquo IEEE Transactions on Nuclear Science vol 21 no3 pp 21ndash43 1974

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 7: Algorithm and Architecture Optimization for 2D Discrete ...downloads.hindawi.com/journals/ijrc/2018/1403181.pdf · Algorithm and Architecture Optimization for 2D Discrete Fourier

International Journal of Reconfigurable Computing 7

where step A and steps B C and D are completed on separateFPGAs while step E can be done on the host PC Two highend FPGAs are used instead of one because resources onone FPGA are insufficient to compute a large size 2D FFTas well its edge artifact removal components The overallperformance may be limited by FPGA 2 where most of theserial part of the algorithm lies There are two major factorswhich limit the throughput of such a design

(1) While FPGA 1 and FPGA 2 can run in parallel theresult of step A from FPGA 1 has to be stored on thehost PC while steps B C and D are completed onFPGA 2 before step E can be completed on the hostPC

(2) The DRAM intermediate storage problem explainedin Section 22 and Figures 1 and 2 has to be addressedsince strided access to the DRAM for column-wiseoperations can significantly limit throughput

As for (1) it has been addressed in the next section wherewe make use of the inherent symmetry of the boundaryimage to reduce the time required to compute the 2D FFTof the boundary image As for (2) it has been addressed bydesigning a semi-custommemory mapping controller whichrdquotilesrdquo the DRAM floor and rdquohopsrdquo between several tiles so asto minimize strided memory access

4 Proposed Optimized Periodic Plus SmoothDecomposition (OPSD)

In this section we optimize the original PSD algorithmThisoptimization is to effectively reduce the number of 1D FFTinvocations and the number of times the DRAM is accessed

Equation (6) shows that except the corners the boundaryimage 119861 is symmetrical in the sense that boundary rows andcolumns are algebraic negation of each other In total119861has 119899+119898 minus 1 unique elements with the following relations betweencorners with respect to columns and rows

11988711 = 11990311 + 119888111198871푚 = 1199031푚 minus 11988811119887푛1 = minus11990311 + 119888푛1119887푛푚 = minus1199031푚 minus 119888푛1(8)

997904rArr 119887푛푚 = minus11988711 minus 1198871푚 minus 119887푛1 (9)

In computing the FFTof119861 one normally proceeds by firstrunning 1D FFTs column-by-column and then 1D FFTs row-by-row or vice versa An FFT of a column vector 119907with length119899 is119882119907 where119882 is given in (4)The column-wise FFT of thematrix 119861 is then

=119882119861 (10)

Let us have a closer look at the first column denoted by 119861sdot1The 1D FFT of this vector is

sdot1 =119882119861sdot1

= (((((((

1 1 11 119908 119908푛minus11 1199082 1199082(푛minus1) 1 119908푛minus2 119908(푛minus2)(푛minus1)1 119908푛minus1 119908(푛minus1)(푛minus1)

)))))))

((((((

119887111198872111988731 119887푛minus11119887푛1

))))))

=((((((((((((((((

푛sum푖=1

119887푖1푛sum푖=1

119887푖1119908푖minus1푛sum푖=1

119887푖11199082(푖minus1) 푛sum푖=1

119887푖1119908(푛minus2)(푖minus1)푛sum푖=1

119887푖1119908(푛minus1)(푖minus1)

))))))))))))))))

(11)

It can be shown that the 1D FFT of the column 119895 isin2 3 119898 minus 1 issdot119895 =119882119861sdot119895

= (((((((

1 1 11 119908 119908푛minus11 1199082 1199082(푛minus1) 1 119908푛minus2 119908(푛minus2)(푛minus1)1 119908푛minus1 119908(푛minus1)(푛minus1)

)))))))

((((((

1198871푗00 0minus1198871푗

))))))

= 1198871푗((((((

01 minus 119908푛minus11 minus 119908푛minus2 1 minus 11990821 minus 119908

))))))

= 1198871푗^

(12)

8 International Journal of Reconfigurable Computing

Input 119868(119894 119895) of size 119899 times 119898Output (119904 119905) (119904 119905)

Step A Compute the 2D DFT of image 119868(119894 119895)1 119868(119894 119895) F997888rarr (119904 119905)Step B Compute periodic border 119861

2 while 1 lt 119895 lt 119898 do3 while 1 lt 119894 lt 119899 do4 if (119894 = 0 or 119894 = 119899 minus 1) then5 119877(119894 119895) larr997888 119868(119899 minus 1 minus 119894 119895) minus 119868(119894 119895)6 else7 119877(119894 119895) larr997888 08 end if9 if (119895 = 0 or 119895 = 119898 minus 1) then10 119862(119894 119895) larr997888 119868(119894 119898 minus 1 minus 119895) minus 119868(119894 119895)11 else12 119862(119894 119895) larr997888 013 end if14 end while15 end while16 119861larr997888 119877 + 119862Step C Compute the 2D DFT of (119904 119905)

17 119861(119894 119895) F997888rarr (119904 119905)Step D Compute the Smooth Component (119904 119905)

18 (119904 119905) larr997888 2 cos(2120587119904119899) + 2 cos(2120587119905119898) minus 419 (119904 119905) larr997888 (119904 119905) divide (119904 119905)

Step E Compute the Periodic Component (119904 119905)20 (119904 119905) larr997888 (119904 119905) minus (119904 119905)21 return (119904 119905) (119904 119905)

Algorithm 1 Periodic plus smooth decomposition (PSD)

and the 1D FFT of the last column 119861sdot119898 is

sdot119898 =119882119861sdot119898 (13)

=((((((((((((((((

minus 푛sum푖=1

119887푖1minus 푛sum푖=1

119887푖1119908푖minus1 + (11988711 + 1198871푚) (1 minus 119908푛minus1)minus 푛sum푖=1

119887푖11199082(푖minus1) + (11988711 + 1198871푚) (1 minus 1199082(푛minus1)) minus 푛sum푖=1

119887푖1119908(푛minus2)(푖minus1) + (11988711 + 1198871푚) (1 minus 119908(푛minus2)(푛minus1))minus 푛sum푖=1

119887푖1119908(푛minus1)(푖minus1) + (11988711 + 1198871푚) (1 minus 119908(푛minus1)(푛minus1))

))))))))))))))))

(14)

sdot119898 = minussdot1 + (11988711 + 1198871푚) ^ (15)

Therefore the column-wise FFT of the matrix 119861 is

= (119861sdot1 11988712^ 1198871푚minus1^ minussdot1 + (11988711 + 1198871푚) ^) (16)

To compute the column-by-column 1DFFT of thematrix119861 we only have to compute the FFT of the first vectorand then use the appropriately scaled vector ^ to derivethe remainder of the columns The row-by-row FFT has

to be calculated in a row burst normal way Algorithm 2presents a summary of the shortcut for calculating (119904 119905)The steps presented in Algorithm 2 can replace step Cin Algorithm 1 By reducing column-by-column 1D FFTcomputations for the boundary image this method cansignificantly reduce the number of 1D FFT invocationsreduce the overall DRAM access and eliminate problematiccolumn-wise strided DRAM access for an efficient FPGA-based implementation For column-wise operations a single1D FFT of size 119898 is required rather than 119899119898 1D FFTs of size119898 Moreover since one has to simply store one column ofdata it can be stored on the on-chip local memory (BRAMor SRAM) This can be implemented by temporarily storingthe initial vector sdot1 and scaling factors 1198871푗 in the blockRAMregister memory drastically reducing DRAM accessand lowering the number of required 1D FFT invocationsPerformance evaluation for this has been presented in theresults section

Table 1 shows a comparison of mirroring PSD andour proposed OPSD with respect to DRAM access pointsMirroring has been used for comparison purposes because itis an alternative technique that reduces edge artifacts whilemaintaining maximum amplitude information Howeverdue to replication of the imagemost of the phase informationis lost Figure 5 graphically shows that our OPSDmethod sig-nificantly reduces reading fromexternalmemory and reduces

International Journal of Reconfigurable Computing 9

Input 119861(119894 119895) of size 119899 times 119898Output (119904 119905)2DF(119861) lArrrArr 119861 F

119888119908997888997888997888rarr 119888119908 F119903119908997888997888997888rarr

Column-by-Column DFT via Symmetrical Short-cut1 119861sdot1

F997888rarr sdot12 while 1 lt 119895 lt 119898 do3 sdot119895 larr997888 1198871푗^4 end while5 sdot119898 larr997888 minussdot1 + (11988711 + 1198871푚)^6 119888119908 larr997888 119862119900119899119888119886119905119890119899119886119905119890 sdot1 sdot2 sdot119898minus1 sdot119898

Row-by-Row DFT7 119888119908

F119903119908997888997888997888rarr (119904 119905)

8 return (119904 119905)cw column-wisecolumn-by-columnrw row-wiserow-by-row

Algorithm 2 Proposed symmetrically optimized computation of (119904 119905)

DRA

M A

cces

s Poi

nts

2

18

16

14

12

1

08

06

04

02

0

Image Size (N=M)0 1000 2000 3000 4000 5000

DRAM Access - PSD vs OPSD

MirroringPSDOPSD (Proposed)

times108

Figure 5 Graph showing DRAM access (equal to number of DFT points to be computed) with increasing image size for mirroring periodicplus smooth decomposition (PSD) and our proposed optimized period plus smooth decomposition (OPSD)

the overall number of DFT computations required It shouldbe noted that such optimization is only possible for eithercolumn-wise or row-wise operations because after either ofthese operations the output is not symmetrical anymoreCompleting the column-wise operation first prevents stridedreading however this results in strided writing to the DRAMbefore row-wise traversal can startThis can be minimized bymaking efficient use of the local block RAM Output columnsare stored in the block RAM before being written to theDRAM in patches such that each row buffer access writeselements in several columns

5 Proposed Tile-Hopping Memory Mappingfor 2D FFTs

In this section we propose a tile-hopping external memoryaccess pattern for efficiently addressing external memory

during intermediate storage between row-wise and column-wise 1D FFT operations to calculate a 2D FFT As explainedin Section 22 and Figure 2 column-wise reads from DRAMcan be costly due to the overhead associated with activatingand precharging In the worst case scenario it can limitDRAM bandwidth up to 80 [26] This is a problem withall such image processing operations where one stage of theprocessing has to be completed on all elements before thenext stage can start In the past there have been severalimplementations using localmemory however with growingdemand for larger image sizes external memory has to beused There have been several DRAM remapping attemptsbefore such as [5 13] They propose a tile-based approachwhere an 119899 times 119899 image (input array) is divided into 119899119896 times 119899119896tiles where 119896 is the size of the DRAM row buffer whichallows for very high-bandwidthDRAMaccess Although this

10 International Journal of Reconfigurable Computing

Table 1 Comparing mirroring PSD and OPSD

Algorithm DRAM Access DFTPoints Points

Mirroring 8119873119872 8119873119872P + S Decomposition (PSD) 4119873119872 4119873119872Optimized PSD (Proposed) 3119873119872 +119873 +119872 minus 1 3119873119872 +119872method may be ideal to maximize the DRAM performancefor 2D FFTs it incurs a high resource cost associated withlocal memory transposition and storing large chunks of data(entire rowcolumn of tiles) in the local memory Moreovertiling in the image domain also requires remapping row-by-row operations Another approach to reducing stridedDRAM access has been presented in [27] They present a2D decomposition algorithmwhich decomposes the probleminto smaller subblock 2D FFTs which can be performedlocallyThis introduces extra row and columndata exchangesand total number of operations are increased fromO(1198992 log 119899)toO(1198992(1+log 119899)) Other implementations do not address theexternal memory issue in detail

We propose tile-hopping address mapping which reducesthe number of row activations required to access a singlecolumn Unlike [5] our approach does not require significantlocal operations or storage The proposed memory mappingcontroller was designed on top of LabVIEW FPGArsquos existingmemory controller which efficiently controls interleaving andissues activation and precharge commands in parallel withdata transfer The reduced number of row activations alsoreduces the amount of energy required by the DRAM Thiswill be further discussed in the energy evaluation presentedin Section 64 and Table 3 where we demonstrate that theproposed tile-hopping address mapping can reduce energyconsumption up to 53

Instead of writing the results of the row-by-row 1D FFTin row-major order we remap the results in a blocked or tiledpattern as shown in Figure 6This means that when accessingan image column several elements of that column can beretrieved from a single DRAM row access For an 119899times119899 imageeach row of size 119899 can be divided into ℎ tiles (ie 119899 = ℎ119873(119905)where119873(119905) is the number of elements in each tile)These tilescan be remapped onto the DRAM floor as shown in Figure 6If the size of the row is small enough it may be possible toconvert it into a single tile (ie ℎ = 1) However this isunlikely for realistic image sizes For a tile of size119901times119902 a singlerow of the image is written into the DRAM by transitioningthrough 119901ℎ rows If 119896 is the size of the row buffer thereare 119896119902 distinct tiles represented in each DRAM row and itcontains the same number of elements from a single imagecolumn Given regular row-major storage when accessingcolumn-wise elements one would have to transition through119899 DRAM rows to read a single image column Howeverwith this approach when accessing an image column 119896119902elements of that column could be read from a single DRAMrow which has been referred to in the row buffer Althoughthe cost of writing an image row is higher when compared toa standard row-major DRAM writing pattern (ie referring

to119901ℎ rather than 119899 rows) the number ofDRAMrow referralsduring column-wise read is reduced to 119899119902119896 which is 119899(1 minus119902119896) less row referrals for a single column read

We refer to this method as tile-hopping because it entailsmapping data onto several DRAM tiles and then hoppingbetween the tiles such that several elements of the imagecolumn exist in a DRAM row which has been referred toin the row bank Although this mapping scheme has beendeveloped for column-wise access required during 2D FFTcalculation the scheme is general and can be adapted to otherapplications Performance evaluation of thismethod has beenpresented in the experiments section

6 Experimental Results and Analysis

61 Hardware Configuration and Target Selection Since 2DDFTs are usually used for simplifying convolution operationsin complex image processing andmachine vision systems weneeded to prototype our design on a system that is expandablefor next levels of processing As mentioned earlier for rapidprototyping of our proposed OPSD algorithm and tile-hopping memory mapping scheme we used a PXIe-basedreconfigurable system PXIe is an industrial extension ofa PCI system with an enhanced bus structure that giveseach connected device dedicated access to the bus with amaximum throughput of 24GBs This allows a high-speeddedicated link between a host PC and several FPGAs TheLabVIEWFPGAgraphical design environment is efficient forrapid prototyping of complicated signal and image processingsystems It allows us to effectively integrate external HDLcode and LabVIEW graphical design on a single platformMoreover it allows a combination of high-level synthesis(HLS) and custom logic Since current HLS tools havelimitations when it comes to complex image and signalprocessing tasks LabVIEW FPGA tries to bridge these gapsby streamlining the design process

We used FlexRIO (Flexible Reconfigurable IO) FPGAboards plugged into a PXIe chassis PXIe FlexRIO FPGAboards are adaptable and can be used to achieve highthroughput because they allow direct data transfer betweenmultiple FPGAs at rates as high as 8GBs This can signifi-cantly simplify multi-FPGA systems which usually commu-nicate via a host PC This feature allows expansion of oursystem to further processing stages making it flexible for avariety of applications Figure 7 shows a basic overview ofa PXIe-based multi-FPGA system with a host PC controllerconnected through a high-speed bus on a PXIe chassis

Specifically we used two NI PXIe-7976R FlexRIO boardswhich have Kintex 7 FPGA and 2GB external DRAM withtheoretical data bandwidth up to 10GBs This FPGA boardwas plugged into a PXIe-1085 chassis along with a PXIe-8880Intel Xeon PC controller PXIe-1085 can hold up to 16 FPGAsand has 8 GBs per-slot dedicated bandwidth and an overallsystem bandwidth of 24 GBs

62 Experimental Setup As per Algorithms 1 and 2 dis-cussed in previous sections implementation involves fivestages (A) calculating the 2D FFT of an image frame (B)

International Journal of Reconfigurable Computing 11

n

n

Row-by-Row (RR) Output

(a) Dataset view

k

p

q

Writing RR Output to DRAM

Row Buffer

(b) Memory view

k

Reading from DRAM

Row Buffer

(c) Memory view

Figure 6 Image showing tile-hopping (a) Image-level view showing tiles (b) DRAM-level view showing tile placement while writing (c)DRAM-level view showing column reading from the tiles

External IO PXIe

Host PC ControllerDisplay

Device

PXIeFPGA Board 1

High-throughput PXIe BUSPXIe Chassis

External IO

PXIeFPGA Board 2

External IO

PXIeFPGA Board 3

External IO

Figure 7 Block diagram of a PXIe-based multi-FPGA system witha host PC controller connected through a high-speed bus on a PXIechassis [8]

calculating the boundary image (C) calculating the 2DFFT of the boundary image (D) calculating the smoothcomponent and (E) subtracting the smooth component fromthe 2D FFT of the original image to achieve the periodiccomponent The bottleneck consistently occurs in A and theserial part of the algorithm (A 997888rarr B 997888rarr C) The limitation

due to A is reduced removing the so-called ldquomemory wallrdquoby using our proposed tile-hopping-based memory mappingThe limitations due to the serial part of the algorithm arereduced by using OPSD rather than PSD For quantificationthe delay for A is 062ms for a 512 times 512 image

The design flow presented in Figure 4 was followedData-flow is clearly shown in a graphical programmingenvironment making it easier to visualize how a designefficiently fits on an FPGA Highly efficient implementationsof 1D FFT were used from LabVIEW FPGA for parallelrow-by-row operations and by integrating Xilinx LogiCOREfor column-by-column operations Each stage of the designwas dynamically tested and benchmarked The image wasstreamed from the host PC using a Direct Memory Access(DMA) FIFO 1D FFTs are performed in parallel rows of 8and stored in the DRAM via local memory in a tiled patternas explained in the previous section This follows readingseveral rows to extract a single column which is Fourier-transformed using Xilinx LogiCORE and is sent back to

12 International Journal of Reconfigurable Computing

I(00)

I(n-1 m-1)

i

j

N x M size image

a b

a

1D FFT

c

Ori

gina

l Im

age

Boun

dary

Imag

e

TrigonometricFunctions

Equation (7)

FFT ofSmooth Component

FFT of Periodic Component(With Edge Artifact

Removed)

b

Shortcut for Column-wise 1D FFT

Exte

rnal

Mem

ory

Exte

rnal

Mem

ory

Row-wise 1D FFT

c

Bloc

k M

emor

y

Exte

rnal

Mem

ory

Row-wise 1D FFT

Host PC Controller(NI PXIe-8880)

FPGA-1(NI PXIe-7976R)

Host PC Controller (NI PXIe-8880)

PXIe Chassis (NI-PXIe1085)

FPGA-2(NI PXIe-7976R)Block Memory

Bloc

k M

emor

y

+

minusColumn-wise 1D FFTlowast

Row-Major Memory Mapping Tile-Hopping Memory Mapping

I(n-1 j) ndash I(0 j)

I(i

0) ndash

I(i

m-1

)

I(i

m-1

) ndash I(

i 0)

I(0 j) ndash I(n-1 j)

middot1

b1jv

-middot1 + (b11+b1m)v

Figure 8 Functional block diagramof PXIe-based 2DFFT implementationwith simultaneous edge artifact removal using optimized periodicplus smooth decomposition The OPSD algorithm is split among two NI-7976R (Kintex-7) FPGA boards with 2GB external memory and ahost PC connected over a high-bandwidth bus The image is streamed from the PC controller to FPGA 1 and FPGA 2 FPGA 1 calculates therow-by-row 1D FFT followed by column-by-column 1D FFT with intermediate tile-hopping memory mapping and sends the result back tothe host PC FPGA 2 receives the image calculates the boundary image and proceeds to calculate the 1D FFT column-by-column FFT usingthe shortcut presented in (16) followed by row-by-row 1D FFTs and the result is sent back to the host PC

Camera

HostPC

External Memory

FPGA Board 1

FPGA

DMAFIFO

FIFO IN(Local Memory - (Local Memory ndash

Write)

FIFO OUT

Read)

PXIe

BU

S

CU

1D FF4H

1D FF41

1D FF42

Figure 9 Block diagram of 2D FFT showing data transfer betweenexternal memory and local memory scheduled via a Control Unit(CU)

the host PC If the image is being streamed directly froman imaging device which scans and provides random ornonlinear sequence of rows it is necessary to store a frameof the image in a buffer This can also be accomplishedby streaming the image flow from the host PC or using asmart camera which can delay image delivery by a singleframe Local memory shown in Figure 9 is used to bufferdata between external memory and 1D FFT cores Thismemory is divided into read and write components and isimplemented using FPGA slices Block RAM (BRAM) is usedfor temporary storage of vectors required for calculating the2D FFT of the boundary image (in the case of FPGA 2)The Control Unit (CU) organizes scheduling for transferring

data between local and external memory CU is based onLabVIEWrsquos existing memory controller and our memorymapping scheme presented in Section 5

Step B was accomplished using standard LabVIEWFPGA HLS tools for programming (5) using the graphicalprogramming environment In step C the 2D FFT of theboundary image needs to be calculated by row and columndecomposition However as shown mathematically in theprevious section the initial row-wise FFTs can be calculatedby computing the 1D FFT of the first (boundary) vector andthe FFTs of reaming vectors can be computed by appropriatescaling of this vector

We need the boundary column vector for 1D FFT calcula-tion of the first and last columns We also need the boundaryrow vector for appropriate scaling of for the 1D FFT ofevery column between the first and last columns Row andcolumn vectors of the boundary image are stored in blockRAM (BRAM) Figure 8 shows a functional block diagramof the overall 2D FFT with optimized PSD process Steps Dand E are performed on the host PC to minimize memoryclashes and to access the periodic and smooth componentsof each frame as they become available 86 of resources areused on FPGA 1 and 41 resources are used on FPGA 2 Theresource utilization is reported according to LabVIEWFPGAsynthesis and compilation experiments It should be notedthat part of the reason for high resource utilization is becauseof using LabVIEW FPGA high-level synthesis tools as wellas Xilinx LogiCORE tools Using standardized tools makesit harder to optimize for resource utilization The currentimplementation was optimized for performance in terms ofrun time and energy consumption

International Journal of Reconfigurable Computing 13

Fram

esS

econ

d (1

s)

105

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7)Kintex-7 (No DRAM Opt)Kintex-7 (DRAM Opt)Theoretical Peak

(a) 2D FFT performance DRAM tile-hopping

Fram

esS

econ

d (1

s)

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7) (PSD)Kintex-7 (PSD)Kintex-7 (OPSD)

(b) 2D FFT with EAR performance PSD versus OPSD

Figure 10 Performance evaluation in terms of frames per second for (a) 2D FFTs with tile-hopping memory pattern (b) 2D FFTs with edgeartifact removal (EAR) using OPSDThe performance evaluation shows the significance of the two optimizations proposed Both axes are ona log scale

63 Performance Evaluation The overall performance of thesystem was evaluated using the setup presented in Figure 9The data was streamed from the host PC in certain caseshigh frame rate videos as well as direct camera input werestreamed from the host All results presented are for 16-bitfixed-point precision Figure 10(a) presents the effectivenessof our proposed tile-hopping memory mapping scheme Itclearly shows the effectiveness of our proposed memorymapping since it is closer to the theoretical peak performanceFigure 10(b) presents the overall results comparing PSD andOPSD and demonstrating the effectiveness of our proposedoptimization PSD was also implemented on the same plat-form but the optimization presented in Section 4 was notused This rendered the serial portion of the algorithm to bethe bottleneck which reduced overall performance Table 2shows a comparison of our implementation in contrast torecent 2D FFT FPGA implementation approaches and showsthat we achieve a better performance even with simultaneousedge artifact removal Although our implementation is testedwith 16-bit fixed-point precision which limits the accuracy ofthe transform the precision may be sufficient for a varietyof speed critical applications where alternative edge artifactremoval methods (eg filtering) may decrease overall systemperformance Although the dynamic range of fixed-pointdata is smaller than floating-point data which can lead toerrors in the 2D FFT for certain applications it is moreimportant to have an artifact-free transform as compared toa highly accurate transform

64 Energy Evaluation The overall energy consumptionof the custom computing system depends on (1) powerperformance of the system components and (2) throughputdisparity Throughput disparity results in idle time for at

least one of the components and lowers the overall sys-tem throughput A throughput-optimized system minimizesinstances where certain components of the architecture areidle The proposed optimizations in Sections 4 and 5 clearlyreduce throughput disparity andminimize the idle time of thesystem Thus besides causing delays due to significant over-head standard column-wise DRAM access also contributesto the overall energy consumption This is not only due tohigh count of DRAM row charges but also because of energyconsumed by the FPGA in idle state Ideally maximizing theDRAMbandwidth limits the amount of energy consumptionProposed ldquotile-hoppingrdquo memory mapping scheme improvesthe DRAMbandwidth as seen in Figure 10 and hence reducesthe overall energy consumption The same is true for theproposed OPSD method where reduced DRAM access and1D FFT invocations lead to reduced energy consumption Inthis section we analyze the amount of improvement in energyconsumption based on the proposed optimizations

We estimate the DRAM power consumption for boththe baseline (standard strided) and the optimized (ldquotile-hoppingrdquo) memory access using MICRON DRAM powercalculator The energy is calculated in 119899119869 for each readie energy per read This is accomplished by calculatingthe run time for a specific 2D FFT and estimating theamount of energy consumed by the DRAM power calculatorTable 4 depicts the DRAM energy consumption for 2D FFTsbefore and after the proposed ldquotile-hoppingrdquo optimization forcolumn-wise DRAM access As mentioned earlier row-wiseDRAM access is fast and row buffer size data can be accessedby a single row activation According to Table 4 the energyrequired for DRAM access is reduced by 427 488 and529 for 1024 times 1024 2048 times 2048 and 4096 times 4096 size 2DFFTs respectively

14 International Journal of Reconfigurable Computing

Table 2 Comparison of OPSD1 2D FFT with regular RCD-based implementation

Platform SEAR2 Precision RT3 (ms)YesNo bits 512 times 512 1024 times 1024

Kintex 7 28nm (ours) Yes 16 (fixed) 15 48Kintex 7 28nm (ours) No 16 (fixed) 09 41Kintex 7 28mm [8] Yes 16 (fixed) 324 1167Stratix IV [5] No 64 (double) - 61Virtex-5-BEE3 65nm[14] No 32 (single) 249 1026Virtex-E 180nm [21] No 16 (fixed) 286 769ASIC 180nm No 32 (single) 210 -1 Optimized periodic + smooth decomposition (OPSD)2 Simultaneous edge artifact removal3 Runtime (ms)

Table 3 DRAM energy consumption baseline vs tile-hopping1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPRlowast CW∘ Read(Baseline) 446 577 712EPR CW ReadTile-Hopping 254 295 336(Proposed)Reduction () 427 488 529lowastEnergy per read (EPR)∘Column-wise memory access (CW)

Table 4 2D FFT + EAR energy consumption baseline vs optimized (OPSD + tile hopping)1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPPdagger 2D FFT + EARBaseline 3692 4125 48352D FFT+EAR (Opt)Tile-Hopping + OPSD 1588 1706 1811(Proposed)Improvement 23times 24times 28timesdaggerEnergy per point (EPP)

The metric used to compare the overall energy optimiza-tion achieved for 2D FFTs with EAR is energy per point iethe amount of average energy required to compute the 2DFFT of a single point in an image with simultaneous edgeartifact removal This was achieved by calculating the energyconsumed by Xilinx LogiCORE IP for 1D FFTs the DRAMand the edge artifact removal part separately The estimatedenergy calculated does not include energy consumed by thePXIe chassis and the host PC Essentially the FPGA-basedarchitecture presented here could be used without the hostcontroller The energy consumption incorporates dynamicas well as static power The overall energy consumption perpoint is reduced by 569 586 and 62 for calculating1024 times 1024 2048 times 2048 and 4096 times 4096 size 2D FFTs withEAR respectively

7 Application Filtered Back-Projectionfor Tomography

In order to further demonstrate the effectiveness of ourimplementation we use the created 2D FFT module asan accelerator for reducing the run time for filtered back-projection (FBP) FBP is a fundamental analytical tomo-graphic image reconstruction method In depth detailsregarding the basic FBP algorithm have been left out forbrevity but can be found in [28 29]Themethod can be usedto reconstruct primitive 3D tomograms from 2D data whichcan then be used as a basis for more complex regularization-basedmethods such as [30ndash32]The algorithmic flow is basedon the Fourier slice theorem ie 2D Fourier transformsof projections are an angular component of the 3D Fourier

International Journal of Reconfigurable Computing 15

Table 5 Comparing filtered back-projection (FBP) runtime (as an application for using the proposed 2D FFT with simultaneous EAR)

3D Density CPU (i7) FPGA + Host PC (i7)Sec Sec128 times 128 times 128 213 sec 195 sec256 times 256 times 256 475 sec 424 sec512 times 512 times 512 948 sec 813 sec1024 times 1024 times 1024 3223 sec 2753 sec2048 times 2048 times 2048 16877 sec 13644 sec4096 times 4096 times 4096 164631 sec 125994 sec

Original Phantom CPU Reconstructed Phantom FPGA+CPU Reconstructed Ph

Inte

nsity

Inte

nsity

1

08

06

04

02

0

1

08

06

04

02

0

X Direction0 20 40 60 80 100 120

X Direction0 20 40 60 80 100 120

Original PhantomCPU BP

Original PhantomFPGA+CPU BP

Figure 11 Figure showing a thin slice of filtered back-projectionresults by reconstructing a 128 times 128 times 128 Shepp-Logan phantomThe 3Ddensitywas reconstructed from 180 equally spaced simulatedprojections using standard linearly interpolated FBP and using aRam-Lak filter It can be seen that the results from the FPGA + CPUsolution have some errors this is due to the fact that the 2D FFT ofeach projection is less accurate The results are good enough to beused as a basis for further optimization based refinement methods

transform of the 3D reconstructed volume Our 2D FFTaccelerator was used to calculate the 2D FFTs of the projec-tions as well as for initial stages of the 3D FFTwhich was thencompleted on the host PC Similar to the 2D FFT the 3D FFTis separable and can be divided into 2D FFTs and 1D FFTsThe results have been shown in Table 5 It can be seen thatthe improvement for smaller size densities is not significantbecause their FFTs are quite fast on general-purpose CPUsHowever for larger densities the FFT accelerator can give asignificant improvement If the remaining components arealso implemented on an FPGA significant speed increasecan be achieved Results of a thin slice from a 3D simulatedShepp-Logan [33] phantom have been shown in Figure 11 Itcan be seen that the results from the hardware acceleratedFBP are of slightly lower quality This is due to the factthat our 2D FFT implementation is less accurate (16 bitfixed-point) as compared to the CPU-based implementation(FFTW double-precision floating) The accelerated FBP wasalso tested with real Electron Tomography (ET) data

8 Conclusion

2D FFTs often become a major bottleneck for high-performance imaging and vision systems The inherentcomputational complexity of the 2D FFT kernel is fur-ther enhanced if effective removal (using PSD) of spuriousartifacts introduced by the nonperiodic nature of real-lifeimages is taken into accountWe developed and implementedan FPGA-based design for calculating high-throughput 2DDFTs with simultaneous edge artifact removal Our approachis based on a PSD algorithm that splits the frequency domainof a 2D image into a smooth component which contains thehigh-frequency cross-shaped artifacts and can be subtractedfrom the 2D DFT of the original image to obtain a periodiccomponent that is artifact-free Since this approach calculatestwo 2D DFTs simultaneously external memory addressingand repeated 1D FFT invocations become problematic Tosolve this problem we optimized the original PSD algorithmto reduce the number of DFT samples to be computed andDRAM access Moreover to reduce strided access from theDRAMduring column-wise reads we presented and analyzedldquotile-hoppingrdquo a memorymapping scheme which reduces thenumber of DRAM row activations when reading a singlecolumn of data This memory mapping scheme is generaland may be used for a variety of other applications Wedemonstrate that ldquotile-hoppingrdquomemorymapping can reducethe DRAM energy consumption by 529 Moreover weshow that the proposed optimizations lead to 28times less energyconsumption for the overall 2D FFT with EAR architecture

Our methods were tested using extensive synthesis andbenchmarking using a Xilinx Kintex 7 FPGA communicatingwith a host PC on a high-speed PXIe bus Our system isexpandable to support several FPGAs and can be adaptedto various large-scale computer vision and biomedical appli-cations Despite decomposing the image into periodic andsmooth frequency components our design requires lessrun time compared to traditional FPGA-based 2D DFTimplementation approaches and can be used for a varietyof highly demanding applications One such applicationfiltered back-projection was accelerated using the proposedimplementation to achieve better results specifically for largersize raw tomographic data

Conflicts of Interest

The authors declare that they have no conflicts of interest

16 International Journal of Reconfigurable Computing

Acknowledgments

Thisworkwas supported by JapaneseGovernmentOIST Sub-sidy for Operations (Ulf Skoglund) Grant no 5020S7010020FaisalMahmood andMart Toots were additionally supportedby the OIST PhD Fellowship The authors would like tothank National Instruments Research for their technicalsupport during the design process The authors would alsolike to thank Dr Steven D Aird for language assistance andShizuka Kuda for logistical arrangements

Supplementary Materials

File 2D FFT-EAR-Demomp4 video demonstrating theFPGA output of 2D FFT with edge artifact removal(Supplementary Materials)

References

[1] R N Bracewell The fourier transform and iis applications vol5 New York NY USA 1965

[2] Theory and application of digital signal processing vol 1Prentice-Hall Inc Englewood Cliffs NJ USA 1975

[3] R A Brooks and G Di Chiro ldquoTheory of image reconstructionin computed tomographyrdquo Radiology vol 117 no 3 I pp 561ndash572 1975

[4] L Moisan ldquoPeriodic plus smooth image decompositionrdquo Jour-nal of Mathematical Imaging and Vision vol 39 no 2 pp 161ndash179 2011

[5] B Akin P A Milder F Franchetti and J C Hoe ldquoMemorybandwidth efficient two-dimensional fast Fourier transformalgorithm and implementation for large problem sizesrdquo inProceedings of the 20th IEEE International Symposium on Field-Programmable Custom Computing Machines FCCM 2012 pp188ndash191 Canada May 2012

[6] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[7] H Kee N Petersen J Kornerup and S S BhattacharyyaldquoSystematic generation of FPGA-based FFT implementationsrdquoin Proceedings of the 2008 IEEE International Conference onAcoustics Speech and Signal Processing ICASSP pp 1413ndash1416USA April 2008

[8] F Mahmood M Toots L-G Ofverstedt and U Skoglundldquo2DDiscrete Fourier Transformwith simultaneous edge artifactremoval for real-Time applicationsrdquo in Proceedings of the Inter-national Conference on Field Programmable Technology FPT2015 pp 236ndash239 New Zealand December 2015

[9] M Frigo and S G Johnson ldquoFFTW an adaptive software archi-tecture for the FFTrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing vol 3 pp1381ndash1384 IEEE May 1998

[10] M Puschel J M F Moura J R Johnson et al ldquoSPIRAL codegeneration for DSP transformsrdquo Proceedings of the IEEE vol 93no 2 pp 232ndash273 2005

[11] EWang Q Zhang B Shen et alHigh-Performance Computingon the Intel Xeon Phi Springer International Publishing 2014

[12] B Akin F Franchetti and J C Hoe ldquoFFTS with near-optimalmemory access through block data layoutsrdquo in Proceedings ofthe 2014 IEEE International Conference onAcoustics Speech andSignal Processing ICASSP 2014 pp 3898ndash3902 Italy May 2014

[13] B Akin F Franchetti and J C Hoe ldquoUnderstanding thedesign space of DRAM-optimized hardware FFT acceleratorsrdquoin Proceedings of the 25th IEEE International Conference onApplication-Specific Systems Architectures and Processors ASAP2014 pp 248ndash255 Switzerland June 2014

[14] C-L Yu K Irick C Chakrabarti and V Narayanan ldquoMul-tidimensional DFT IP generator for FPGA platformsrdquo IEEETransactions on Circuits and Systems I Regular Papers vol 58no 4 pp 755ndash764 2011

[15] DHe andQ Sun ldquoApractical print-scan resilientwatermarkingschemerdquo in Proceedings of the IEEE International Conference onImage Processing 2005 ICIP 2005 pp 257ndash260 Italy September2005

[16] D G Bailey Design for embedded image processing on FPGAsJohn Wiley Sons 2011

[17] B G Batchelor ldquoImplementing Machine Vision Systems UsingFPGAsrdquo in Machine Vision Handbook 1136 p 1103 SpringerLondon UK 2012

[18] T Lenart M Gustafsson and V Owall ldquoA hardware accelera-tion platform for digital holographic imagingrdquo Journal of SignalProcessing Systems vol 52 no 3 pp 297ndash311 2008

[19] G H Loh ldquo3D-Stacked Memory Architectures for Multi-coreProcessorsrdquo ACM SIGARCH Computer Architecture News vol36 no 3 pp 453ndash464 2008

[20] Z Qiuling B Akin H E Sumbul et al ldquoA 3D-stacked logic-in-memory accelerator for application-specific data intensivecomputingrdquo in Proceedings of the IEEE International 3D SystemsIntegration Conference pp 1ndash7 October 2013

[21] I S Uzun A Amira and A Bouridane ldquoFPGA implementa-tions of fast Fourier transforms for real-time signal and imageprocessingrdquo pp 283ndash296

[22] T Dillon ldquoTwo Virtex-II FPGAs deliver fastest cheapest besthigh-performance image processing systemrdquo rdquo Xilinx XcellJournal vol 41 pp 70ndash73 2001

[23] A Hast ldquoRobust and invariant phase based local featurematchingrdquo in Proceedings of the 22nd International Conferenceon Pattern Recognition ICPR 2014 pp 809ndash814 Sweden August2014

[24] B Galerne Y Gousseau and J-M Morel ldquoRandom phasetextures theory and synthesisrdquo IEEE Transactions on ImageProcessing vol 20 no 1 pp 257ndash267 2011

[25] R Hovden Y Jiang H L Xin and L F Kourkoutis ldquoPeriodicArtifact Reduction in Fourier Transforms of Full Field AtomicResolution Imagesrdquo Microscopy and Microanalysis vol 21 no2 pp 436ndash441 2014

[26] D G Bailey ldquoThe advantages and limitations of high levelsynthesis for FPGA based image processingrdquo in Proceedings ofthe the 9th International Conference pp 134ndash139 Seville SpainSeptember 2015

[27] W Wang B Duan C Zhang P Zhang and N Sun ldquoAcceler-ating 2D FFT with non-power-of-two problem size on FPGArdquoin Proceedings of the 2010 International Conference on Recon-figurable Computing and FPGAs ReConFig 2010 pp 208ndash213Mexico December 2010

[28] F NattererTheMathematics of Computerized Tomography JohnWiley amp Sons 1986

[29] R A Crowther D J DeRosier and A Klug ldquoThe Reconstruc-tion of aThree-Dimensional Structure from Projections and itsApplication to Electron Microscopyrdquo Proceedings of the RoyalSociety A Mathematical Physical and Engineering Sciences vol317 no 1530 pp 319ndash340 1970

International Journal of Reconfigurable Computing 17

[30] U Skoglund L-G Ofverstedt R M Burnett and G BricogneldquoMaximum-entropy three-dimensional reconstruction withdeconvolution of the contrast transfer function A test applica-tion with adenovirusrdquo Journal of Structural Biology vol 117 no3 pp 173ndash188 1996

[31] F Mahmood N Shahid P Vandergheynst and U SkoglundldquoGraph-based sinogram denoising for tomographic reconstruc-tionsrdquo in Proceedings of the 38th Annual International Confer-ence of the IEEE Engineering in Medicine and Biology SocietyEMBC 2016 pp 3961ndash3964 USA August 2016

[32] F Mahmood N Shahid U Skoglund and P VandergheynstldquoAdaptive Graph-Based Total Variation for TomographicReconstructionsrdquo IEEE Signal Processing Letters vol 25 no 5pp 700ndash704 2018

[33] L A Shepp and B F Logan Jr ldquoThe Fourier reconstruction of ahead sectionrdquo IEEE Transactions on Nuclear Science vol 21 no3 pp 21ndash43 1974

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 8: Algorithm and Architecture Optimization for 2D Discrete ...downloads.hindawi.com/journals/ijrc/2018/1403181.pdf · Algorithm and Architecture Optimization for 2D Discrete Fourier

8 International Journal of Reconfigurable Computing

Input 119868(119894 119895) of size 119899 times 119898Output (119904 119905) (119904 119905)

Step A Compute the 2D DFT of image 119868(119894 119895)1 119868(119894 119895) F997888rarr (119904 119905)Step B Compute periodic border 119861

2 while 1 lt 119895 lt 119898 do3 while 1 lt 119894 lt 119899 do4 if (119894 = 0 or 119894 = 119899 minus 1) then5 119877(119894 119895) larr997888 119868(119899 minus 1 minus 119894 119895) minus 119868(119894 119895)6 else7 119877(119894 119895) larr997888 08 end if9 if (119895 = 0 or 119895 = 119898 minus 1) then10 119862(119894 119895) larr997888 119868(119894 119898 minus 1 minus 119895) minus 119868(119894 119895)11 else12 119862(119894 119895) larr997888 013 end if14 end while15 end while16 119861larr997888 119877 + 119862Step C Compute the 2D DFT of (119904 119905)

17 119861(119894 119895) F997888rarr (119904 119905)Step D Compute the Smooth Component (119904 119905)

18 (119904 119905) larr997888 2 cos(2120587119904119899) + 2 cos(2120587119905119898) minus 419 (119904 119905) larr997888 (119904 119905) divide (119904 119905)

Step E Compute the Periodic Component (119904 119905)20 (119904 119905) larr997888 (119904 119905) minus (119904 119905)21 return (119904 119905) (119904 119905)

Algorithm 1 Periodic plus smooth decomposition (PSD)

and the 1D FFT of the last column 119861sdot119898 is

sdot119898 =119882119861sdot119898 (13)

=((((((((((((((((

minus 푛sum푖=1

119887푖1minus 푛sum푖=1

119887푖1119908푖minus1 + (11988711 + 1198871푚) (1 minus 119908푛minus1)minus 푛sum푖=1

119887푖11199082(푖minus1) + (11988711 + 1198871푚) (1 minus 1199082(푛minus1)) minus 푛sum푖=1

119887푖1119908(푛minus2)(푖minus1) + (11988711 + 1198871푚) (1 minus 119908(푛minus2)(푛minus1))minus 푛sum푖=1

119887푖1119908(푛minus1)(푖minus1) + (11988711 + 1198871푚) (1 minus 119908(푛minus1)(푛minus1))

))))))))))))))))

(14)

sdot119898 = minussdot1 + (11988711 + 1198871푚) ^ (15)

Therefore the column-wise FFT of the matrix 119861 is

= (119861sdot1 11988712^ 1198871푚minus1^ minussdot1 + (11988711 + 1198871푚) ^) (16)

To compute the column-by-column 1DFFT of thematrix119861 we only have to compute the FFT of the first vectorand then use the appropriately scaled vector ^ to derivethe remainder of the columns The row-by-row FFT has

to be calculated in a row burst normal way Algorithm 2presents a summary of the shortcut for calculating (119904 119905)The steps presented in Algorithm 2 can replace step Cin Algorithm 1 By reducing column-by-column 1D FFTcomputations for the boundary image this method cansignificantly reduce the number of 1D FFT invocationsreduce the overall DRAM access and eliminate problematiccolumn-wise strided DRAM access for an efficient FPGA-based implementation For column-wise operations a single1D FFT of size 119898 is required rather than 119899119898 1D FFTs of size119898 Moreover since one has to simply store one column ofdata it can be stored on the on-chip local memory (BRAMor SRAM) This can be implemented by temporarily storingthe initial vector sdot1 and scaling factors 1198871푗 in the blockRAMregister memory drastically reducing DRAM accessand lowering the number of required 1D FFT invocationsPerformance evaluation for this has been presented in theresults section

Table 1 shows a comparison of mirroring PSD andour proposed OPSD with respect to DRAM access pointsMirroring has been used for comparison purposes because itis an alternative technique that reduces edge artifacts whilemaintaining maximum amplitude information Howeverdue to replication of the imagemost of the phase informationis lost Figure 5 graphically shows that our OPSDmethod sig-nificantly reduces reading fromexternalmemory and reduces

International Journal of Reconfigurable Computing 9

Input 119861(119894 119895) of size 119899 times 119898Output (119904 119905)2DF(119861) lArrrArr 119861 F

119888119908997888997888997888rarr 119888119908 F119903119908997888997888997888rarr

Column-by-Column DFT via Symmetrical Short-cut1 119861sdot1

F997888rarr sdot12 while 1 lt 119895 lt 119898 do3 sdot119895 larr997888 1198871푗^4 end while5 sdot119898 larr997888 minussdot1 + (11988711 + 1198871푚)^6 119888119908 larr997888 119862119900119899119888119886119905119890119899119886119905119890 sdot1 sdot2 sdot119898minus1 sdot119898

Row-by-Row DFT7 119888119908

F119903119908997888997888997888rarr (119904 119905)

8 return (119904 119905)cw column-wisecolumn-by-columnrw row-wiserow-by-row

Algorithm 2 Proposed symmetrically optimized computation of (119904 119905)

DRA

M A

cces

s Poi

nts

2

18

16

14

12

1

08

06

04

02

0

Image Size (N=M)0 1000 2000 3000 4000 5000

DRAM Access - PSD vs OPSD

MirroringPSDOPSD (Proposed)

times108

Figure 5 Graph showing DRAM access (equal to number of DFT points to be computed) with increasing image size for mirroring periodicplus smooth decomposition (PSD) and our proposed optimized period plus smooth decomposition (OPSD)

the overall number of DFT computations required It shouldbe noted that such optimization is only possible for eithercolumn-wise or row-wise operations because after either ofthese operations the output is not symmetrical anymoreCompleting the column-wise operation first prevents stridedreading however this results in strided writing to the DRAMbefore row-wise traversal can startThis can be minimized bymaking efficient use of the local block RAM Output columnsare stored in the block RAM before being written to theDRAM in patches such that each row buffer access writeselements in several columns

5 Proposed Tile-Hopping Memory Mappingfor 2D FFTs

In this section we propose a tile-hopping external memoryaccess pattern for efficiently addressing external memory

during intermediate storage between row-wise and column-wise 1D FFT operations to calculate a 2D FFT As explainedin Section 22 and Figure 2 column-wise reads from DRAMcan be costly due to the overhead associated with activatingand precharging In the worst case scenario it can limitDRAM bandwidth up to 80 [26] This is a problem withall such image processing operations where one stage of theprocessing has to be completed on all elements before thenext stage can start In the past there have been severalimplementations using localmemory however with growingdemand for larger image sizes external memory has to beused There have been several DRAM remapping attemptsbefore such as [5 13] They propose a tile-based approachwhere an 119899 times 119899 image (input array) is divided into 119899119896 times 119899119896tiles where 119896 is the size of the DRAM row buffer whichallows for very high-bandwidthDRAMaccess Although this

10 International Journal of Reconfigurable Computing

Table 1 Comparing mirroring PSD and OPSD

Algorithm DRAM Access DFTPoints Points

Mirroring 8119873119872 8119873119872P + S Decomposition (PSD) 4119873119872 4119873119872Optimized PSD (Proposed) 3119873119872 +119873 +119872 minus 1 3119873119872 +119872method may be ideal to maximize the DRAM performancefor 2D FFTs it incurs a high resource cost associated withlocal memory transposition and storing large chunks of data(entire rowcolumn of tiles) in the local memory Moreovertiling in the image domain also requires remapping row-by-row operations Another approach to reducing stridedDRAM access has been presented in [27] They present a2D decomposition algorithmwhich decomposes the probleminto smaller subblock 2D FFTs which can be performedlocallyThis introduces extra row and columndata exchangesand total number of operations are increased fromO(1198992 log 119899)toO(1198992(1+log 119899)) Other implementations do not address theexternal memory issue in detail

We propose tile-hopping address mapping which reducesthe number of row activations required to access a singlecolumn Unlike [5] our approach does not require significantlocal operations or storage The proposed memory mappingcontroller was designed on top of LabVIEW FPGArsquos existingmemory controller which efficiently controls interleaving andissues activation and precharge commands in parallel withdata transfer The reduced number of row activations alsoreduces the amount of energy required by the DRAM Thiswill be further discussed in the energy evaluation presentedin Section 64 and Table 3 where we demonstrate that theproposed tile-hopping address mapping can reduce energyconsumption up to 53

Instead of writing the results of the row-by-row 1D FFTin row-major order we remap the results in a blocked or tiledpattern as shown in Figure 6This means that when accessingan image column several elements of that column can beretrieved from a single DRAM row access For an 119899times119899 imageeach row of size 119899 can be divided into ℎ tiles (ie 119899 = ℎ119873(119905)where119873(119905) is the number of elements in each tile)These tilescan be remapped onto the DRAM floor as shown in Figure 6If the size of the row is small enough it may be possible toconvert it into a single tile (ie ℎ = 1) However this isunlikely for realistic image sizes For a tile of size119901times119902 a singlerow of the image is written into the DRAM by transitioningthrough 119901ℎ rows If 119896 is the size of the row buffer thereare 119896119902 distinct tiles represented in each DRAM row and itcontains the same number of elements from a single imagecolumn Given regular row-major storage when accessingcolumn-wise elements one would have to transition through119899 DRAM rows to read a single image column Howeverwith this approach when accessing an image column 119896119902elements of that column could be read from a single DRAMrow which has been referred to in the row buffer Althoughthe cost of writing an image row is higher when compared toa standard row-major DRAM writing pattern (ie referring

to119901ℎ rather than 119899 rows) the number ofDRAMrow referralsduring column-wise read is reduced to 119899119902119896 which is 119899(1 minus119902119896) less row referrals for a single column read

We refer to this method as tile-hopping because it entailsmapping data onto several DRAM tiles and then hoppingbetween the tiles such that several elements of the imagecolumn exist in a DRAM row which has been referred toin the row bank Although this mapping scheme has beendeveloped for column-wise access required during 2D FFTcalculation the scheme is general and can be adapted to otherapplications Performance evaluation of thismethod has beenpresented in the experiments section

6 Experimental Results and Analysis

61 Hardware Configuration and Target Selection Since 2DDFTs are usually used for simplifying convolution operationsin complex image processing andmachine vision systems weneeded to prototype our design on a system that is expandablefor next levels of processing As mentioned earlier for rapidprototyping of our proposed OPSD algorithm and tile-hopping memory mapping scheme we used a PXIe-basedreconfigurable system PXIe is an industrial extension ofa PCI system with an enhanced bus structure that giveseach connected device dedicated access to the bus with amaximum throughput of 24GBs This allows a high-speeddedicated link between a host PC and several FPGAs TheLabVIEWFPGAgraphical design environment is efficient forrapid prototyping of complicated signal and image processingsystems It allows us to effectively integrate external HDLcode and LabVIEW graphical design on a single platformMoreover it allows a combination of high-level synthesis(HLS) and custom logic Since current HLS tools havelimitations when it comes to complex image and signalprocessing tasks LabVIEW FPGA tries to bridge these gapsby streamlining the design process

We used FlexRIO (Flexible Reconfigurable IO) FPGAboards plugged into a PXIe chassis PXIe FlexRIO FPGAboards are adaptable and can be used to achieve highthroughput because they allow direct data transfer betweenmultiple FPGAs at rates as high as 8GBs This can signifi-cantly simplify multi-FPGA systems which usually commu-nicate via a host PC This feature allows expansion of oursystem to further processing stages making it flexible for avariety of applications Figure 7 shows a basic overview ofa PXIe-based multi-FPGA system with a host PC controllerconnected through a high-speed bus on a PXIe chassis

Specifically we used two NI PXIe-7976R FlexRIO boardswhich have Kintex 7 FPGA and 2GB external DRAM withtheoretical data bandwidth up to 10GBs This FPGA boardwas plugged into a PXIe-1085 chassis along with a PXIe-8880Intel Xeon PC controller PXIe-1085 can hold up to 16 FPGAsand has 8 GBs per-slot dedicated bandwidth and an overallsystem bandwidth of 24 GBs

62 Experimental Setup As per Algorithms 1 and 2 dis-cussed in previous sections implementation involves fivestages (A) calculating the 2D FFT of an image frame (B)

International Journal of Reconfigurable Computing 11

n

n

Row-by-Row (RR) Output

(a) Dataset view

k

p

q

Writing RR Output to DRAM

Row Buffer

(b) Memory view

k

Reading from DRAM

Row Buffer

(c) Memory view

Figure 6 Image showing tile-hopping (a) Image-level view showing tiles (b) DRAM-level view showing tile placement while writing (c)DRAM-level view showing column reading from the tiles

External IO PXIe

Host PC ControllerDisplay

Device

PXIeFPGA Board 1

High-throughput PXIe BUSPXIe Chassis

External IO

PXIeFPGA Board 2

External IO

PXIeFPGA Board 3

External IO

Figure 7 Block diagram of a PXIe-based multi-FPGA system witha host PC controller connected through a high-speed bus on a PXIechassis [8]

calculating the boundary image (C) calculating the 2DFFT of the boundary image (D) calculating the smoothcomponent and (E) subtracting the smooth component fromthe 2D FFT of the original image to achieve the periodiccomponent The bottleneck consistently occurs in A and theserial part of the algorithm (A 997888rarr B 997888rarr C) The limitation

due to A is reduced removing the so-called ldquomemory wallrdquoby using our proposed tile-hopping-based memory mappingThe limitations due to the serial part of the algorithm arereduced by using OPSD rather than PSD For quantificationthe delay for A is 062ms for a 512 times 512 image

The design flow presented in Figure 4 was followedData-flow is clearly shown in a graphical programmingenvironment making it easier to visualize how a designefficiently fits on an FPGA Highly efficient implementationsof 1D FFT were used from LabVIEW FPGA for parallelrow-by-row operations and by integrating Xilinx LogiCOREfor column-by-column operations Each stage of the designwas dynamically tested and benchmarked The image wasstreamed from the host PC using a Direct Memory Access(DMA) FIFO 1D FFTs are performed in parallel rows of 8and stored in the DRAM via local memory in a tiled patternas explained in the previous section This follows readingseveral rows to extract a single column which is Fourier-transformed using Xilinx LogiCORE and is sent back to

12 International Journal of Reconfigurable Computing

I(00)

I(n-1 m-1)

i

j

N x M size image

a b

a

1D FFT

c

Ori

gina

l Im

age

Boun

dary

Imag

e

TrigonometricFunctions

Equation (7)

FFT ofSmooth Component

FFT of Periodic Component(With Edge Artifact

Removed)

b

Shortcut for Column-wise 1D FFT

Exte

rnal

Mem

ory

Exte

rnal

Mem

ory

Row-wise 1D FFT

c

Bloc

k M

emor

y

Exte

rnal

Mem

ory

Row-wise 1D FFT

Host PC Controller(NI PXIe-8880)

FPGA-1(NI PXIe-7976R)

Host PC Controller (NI PXIe-8880)

PXIe Chassis (NI-PXIe1085)

FPGA-2(NI PXIe-7976R)Block Memory

Bloc

k M

emor

y

+

minusColumn-wise 1D FFTlowast

Row-Major Memory Mapping Tile-Hopping Memory Mapping

I(n-1 j) ndash I(0 j)

I(i

0) ndash

I(i

m-1

)

I(i

m-1

) ndash I(

i 0)

I(0 j) ndash I(n-1 j)

middot1

b1jv

-middot1 + (b11+b1m)v

Figure 8 Functional block diagramof PXIe-based 2DFFT implementationwith simultaneous edge artifact removal using optimized periodicplus smooth decomposition The OPSD algorithm is split among two NI-7976R (Kintex-7) FPGA boards with 2GB external memory and ahost PC connected over a high-bandwidth bus The image is streamed from the PC controller to FPGA 1 and FPGA 2 FPGA 1 calculates therow-by-row 1D FFT followed by column-by-column 1D FFT with intermediate tile-hopping memory mapping and sends the result back tothe host PC FPGA 2 receives the image calculates the boundary image and proceeds to calculate the 1D FFT column-by-column FFT usingthe shortcut presented in (16) followed by row-by-row 1D FFTs and the result is sent back to the host PC

Camera

HostPC

External Memory

FPGA Board 1

FPGA

DMAFIFO

FIFO IN(Local Memory - (Local Memory ndash

Write)

FIFO OUT

Read)

PXIe

BU

S

CU

1D FF4H

1D FF41

1D FF42

Figure 9 Block diagram of 2D FFT showing data transfer betweenexternal memory and local memory scheduled via a Control Unit(CU)

the host PC If the image is being streamed directly froman imaging device which scans and provides random ornonlinear sequence of rows it is necessary to store a frameof the image in a buffer This can also be accomplishedby streaming the image flow from the host PC or using asmart camera which can delay image delivery by a singleframe Local memory shown in Figure 9 is used to bufferdata between external memory and 1D FFT cores Thismemory is divided into read and write components and isimplemented using FPGA slices Block RAM (BRAM) is usedfor temporary storage of vectors required for calculating the2D FFT of the boundary image (in the case of FPGA 2)The Control Unit (CU) organizes scheduling for transferring

data between local and external memory CU is based onLabVIEWrsquos existing memory controller and our memorymapping scheme presented in Section 5

Step B was accomplished using standard LabVIEWFPGA HLS tools for programming (5) using the graphicalprogramming environment In step C the 2D FFT of theboundary image needs to be calculated by row and columndecomposition However as shown mathematically in theprevious section the initial row-wise FFTs can be calculatedby computing the 1D FFT of the first (boundary) vector andthe FFTs of reaming vectors can be computed by appropriatescaling of this vector

We need the boundary column vector for 1D FFT calcula-tion of the first and last columns We also need the boundaryrow vector for appropriate scaling of for the 1D FFT ofevery column between the first and last columns Row andcolumn vectors of the boundary image are stored in blockRAM (BRAM) Figure 8 shows a functional block diagramof the overall 2D FFT with optimized PSD process Steps Dand E are performed on the host PC to minimize memoryclashes and to access the periodic and smooth componentsof each frame as they become available 86 of resources areused on FPGA 1 and 41 resources are used on FPGA 2 Theresource utilization is reported according to LabVIEWFPGAsynthesis and compilation experiments It should be notedthat part of the reason for high resource utilization is becauseof using LabVIEW FPGA high-level synthesis tools as wellas Xilinx LogiCORE tools Using standardized tools makesit harder to optimize for resource utilization The currentimplementation was optimized for performance in terms ofrun time and energy consumption

International Journal of Reconfigurable Computing 13

Fram

esS

econ

d (1

s)

105

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7)Kintex-7 (No DRAM Opt)Kintex-7 (DRAM Opt)Theoretical Peak

(a) 2D FFT performance DRAM tile-hopping

Fram

esS

econ

d (1

s)

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7) (PSD)Kintex-7 (PSD)Kintex-7 (OPSD)

(b) 2D FFT with EAR performance PSD versus OPSD

Figure 10 Performance evaluation in terms of frames per second for (a) 2D FFTs with tile-hopping memory pattern (b) 2D FFTs with edgeartifact removal (EAR) using OPSDThe performance evaluation shows the significance of the two optimizations proposed Both axes are ona log scale

63 Performance Evaluation The overall performance of thesystem was evaluated using the setup presented in Figure 9The data was streamed from the host PC in certain caseshigh frame rate videos as well as direct camera input werestreamed from the host All results presented are for 16-bitfixed-point precision Figure 10(a) presents the effectivenessof our proposed tile-hopping memory mapping scheme Itclearly shows the effectiveness of our proposed memorymapping since it is closer to the theoretical peak performanceFigure 10(b) presents the overall results comparing PSD andOPSD and demonstrating the effectiveness of our proposedoptimization PSD was also implemented on the same plat-form but the optimization presented in Section 4 was notused This rendered the serial portion of the algorithm to bethe bottleneck which reduced overall performance Table 2shows a comparison of our implementation in contrast torecent 2D FFT FPGA implementation approaches and showsthat we achieve a better performance even with simultaneousedge artifact removal Although our implementation is testedwith 16-bit fixed-point precision which limits the accuracy ofthe transform the precision may be sufficient for a varietyof speed critical applications where alternative edge artifactremoval methods (eg filtering) may decrease overall systemperformance Although the dynamic range of fixed-pointdata is smaller than floating-point data which can lead toerrors in the 2D FFT for certain applications it is moreimportant to have an artifact-free transform as compared toa highly accurate transform

64 Energy Evaluation The overall energy consumptionof the custom computing system depends on (1) powerperformance of the system components and (2) throughputdisparity Throughput disparity results in idle time for at

least one of the components and lowers the overall sys-tem throughput A throughput-optimized system minimizesinstances where certain components of the architecture areidle The proposed optimizations in Sections 4 and 5 clearlyreduce throughput disparity andminimize the idle time of thesystem Thus besides causing delays due to significant over-head standard column-wise DRAM access also contributesto the overall energy consumption This is not only due tohigh count of DRAM row charges but also because of energyconsumed by the FPGA in idle state Ideally maximizing theDRAMbandwidth limits the amount of energy consumptionProposed ldquotile-hoppingrdquo memory mapping scheme improvesthe DRAMbandwidth as seen in Figure 10 and hence reducesthe overall energy consumption The same is true for theproposed OPSD method where reduced DRAM access and1D FFT invocations lead to reduced energy consumption Inthis section we analyze the amount of improvement in energyconsumption based on the proposed optimizations

We estimate the DRAM power consumption for boththe baseline (standard strided) and the optimized (ldquotile-hoppingrdquo) memory access using MICRON DRAM powercalculator The energy is calculated in 119899119869 for each readie energy per read This is accomplished by calculatingthe run time for a specific 2D FFT and estimating theamount of energy consumed by the DRAM power calculatorTable 4 depicts the DRAM energy consumption for 2D FFTsbefore and after the proposed ldquotile-hoppingrdquo optimization forcolumn-wise DRAM access As mentioned earlier row-wiseDRAM access is fast and row buffer size data can be accessedby a single row activation According to Table 4 the energyrequired for DRAM access is reduced by 427 488 and529 for 1024 times 1024 2048 times 2048 and 4096 times 4096 size 2DFFTs respectively

14 International Journal of Reconfigurable Computing

Table 2 Comparison of OPSD1 2D FFT with regular RCD-based implementation

Platform SEAR2 Precision RT3 (ms)YesNo bits 512 times 512 1024 times 1024

Kintex 7 28nm (ours) Yes 16 (fixed) 15 48Kintex 7 28nm (ours) No 16 (fixed) 09 41Kintex 7 28mm [8] Yes 16 (fixed) 324 1167Stratix IV [5] No 64 (double) - 61Virtex-5-BEE3 65nm[14] No 32 (single) 249 1026Virtex-E 180nm [21] No 16 (fixed) 286 769ASIC 180nm No 32 (single) 210 -1 Optimized periodic + smooth decomposition (OPSD)2 Simultaneous edge artifact removal3 Runtime (ms)

Table 3 DRAM energy consumption baseline vs tile-hopping1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPRlowast CW∘ Read(Baseline) 446 577 712EPR CW ReadTile-Hopping 254 295 336(Proposed)Reduction () 427 488 529lowastEnergy per read (EPR)∘Column-wise memory access (CW)

Table 4 2D FFT + EAR energy consumption baseline vs optimized (OPSD + tile hopping)1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPPdagger 2D FFT + EARBaseline 3692 4125 48352D FFT+EAR (Opt)Tile-Hopping + OPSD 1588 1706 1811(Proposed)Improvement 23times 24times 28timesdaggerEnergy per point (EPP)

The metric used to compare the overall energy optimiza-tion achieved for 2D FFTs with EAR is energy per point iethe amount of average energy required to compute the 2DFFT of a single point in an image with simultaneous edgeartifact removal This was achieved by calculating the energyconsumed by Xilinx LogiCORE IP for 1D FFTs the DRAMand the edge artifact removal part separately The estimatedenergy calculated does not include energy consumed by thePXIe chassis and the host PC Essentially the FPGA-basedarchitecture presented here could be used without the hostcontroller The energy consumption incorporates dynamicas well as static power The overall energy consumption perpoint is reduced by 569 586 and 62 for calculating1024 times 1024 2048 times 2048 and 4096 times 4096 size 2D FFTs withEAR respectively

7 Application Filtered Back-Projectionfor Tomography

In order to further demonstrate the effectiveness of ourimplementation we use the created 2D FFT module asan accelerator for reducing the run time for filtered back-projection (FBP) FBP is a fundamental analytical tomo-graphic image reconstruction method In depth detailsregarding the basic FBP algorithm have been left out forbrevity but can be found in [28 29]Themethod can be usedto reconstruct primitive 3D tomograms from 2D data whichcan then be used as a basis for more complex regularization-basedmethods such as [30ndash32]The algorithmic flow is basedon the Fourier slice theorem ie 2D Fourier transformsof projections are an angular component of the 3D Fourier

International Journal of Reconfigurable Computing 15

Table 5 Comparing filtered back-projection (FBP) runtime (as an application for using the proposed 2D FFT with simultaneous EAR)

3D Density CPU (i7) FPGA + Host PC (i7)Sec Sec128 times 128 times 128 213 sec 195 sec256 times 256 times 256 475 sec 424 sec512 times 512 times 512 948 sec 813 sec1024 times 1024 times 1024 3223 sec 2753 sec2048 times 2048 times 2048 16877 sec 13644 sec4096 times 4096 times 4096 164631 sec 125994 sec

Original Phantom CPU Reconstructed Phantom FPGA+CPU Reconstructed Ph

Inte

nsity

Inte

nsity

1

08

06

04

02

0

1

08

06

04

02

0

X Direction0 20 40 60 80 100 120

X Direction0 20 40 60 80 100 120

Original PhantomCPU BP

Original PhantomFPGA+CPU BP

Figure 11 Figure showing a thin slice of filtered back-projectionresults by reconstructing a 128 times 128 times 128 Shepp-Logan phantomThe 3Ddensitywas reconstructed from 180 equally spaced simulatedprojections using standard linearly interpolated FBP and using aRam-Lak filter It can be seen that the results from the FPGA + CPUsolution have some errors this is due to the fact that the 2D FFT ofeach projection is less accurate The results are good enough to beused as a basis for further optimization based refinement methods

transform of the 3D reconstructed volume Our 2D FFTaccelerator was used to calculate the 2D FFTs of the projec-tions as well as for initial stages of the 3D FFTwhich was thencompleted on the host PC Similar to the 2D FFT the 3D FFTis separable and can be divided into 2D FFTs and 1D FFTsThe results have been shown in Table 5 It can be seen thatthe improvement for smaller size densities is not significantbecause their FFTs are quite fast on general-purpose CPUsHowever for larger densities the FFT accelerator can give asignificant improvement If the remaining components arealso implemented on an FPGA significant speed increasecan be achieved Results of a thin slice from a 3D simulatedShepp-Logan [33] phantom have been shown in Figure 11 Itcan be seen that the results from the hardware acceleratedFBP are of slightly lower quality This is due to the factthat our 2D FFT implementation is less accurate (16 bitfixed-point) as compared to the CPU-based implementation(FFTW double-precision floating) The accelerated FBP wasalso tested with real Electron Tomography (ET) data

8 Conclusion

2D FFTs often become a major bottleneck for high-performance imaging and vision systems The inherentcomputational complexity of the 2D FFT kernel is fur-ther enhanced if effective removal (using PSD) of spuriousartifacts introduced by the nonperiodic nature of real-lifeimages is taken into accountWe developed and implementedan FPGA-based design for calculating high-throughput 2DDFTs with simultaneous edge artifact removal Our approachis based on a PSD algorithm that splits the frequency domainof a 2D image into a smooth component which contains thehigh-frequency cross-shaped artifacts and can be subtractedfrom the 2D DFT of the original image to obtain a periodiccomponent that is artifact-free Since this approach calculatestwo 2D DFTs simultaneously external memory addressingand repeated 1D FFT invocations become problematic Tosolve this problem we optimized the original PSD algorithmto reduce the number of DFT samples to be computed andDRAM access Moreover to reduce strided access from theDRAMduring column-wise reads we presented and analyzedldquotile-hoppingrdquo a memorymapping scheme which reduces thenumber of DRAM row activations when reading a singlecolumn of data This memory mapping scheme is generaland may be used for a variety of other applications Wedemonstrate that ldquotile-hoppingrdquomemorymapping can reducethe DRAM energy consumption by 529 Moreover weshow that the proposed optimizations lead to 28times less energyconsumption for the overall 2D FFT with EAR architecture

Our methods were tested using extensive synthesis andbenchmarking using a Xilinx Kintex 7 FPGA communicatingwith a host PC on a high-speed PXIe bus Our system isexpandable to support several FPGAs and can be adaptedto various large-scale computer vision and biomedical appli-cations Despite decomposing the image into periodic andsmooth frequency components our design requires lessrun time compared to traditional FPGA-based 2D DFTimplementation approaches and can be used for a varietyof highly demanding applications One such applicationfiltered back-projection was accelerated using the proposedimplementation to achieve better results specifically for largersize raw tomographic data

Conflicts of Interest

The authors declare that they have no conflicts of interest

16 International Journal of Reconfigurable Computing

Acknowledgments

Thisworkwas supported by JapaneseGovernmentOIST Sub-sidy for Operations (Ulf Skoglund) Grant no 5020S7010020FaisalMahmood andMart Toots were additionally supportedby the OIST PhD Fellowship The authors would like tothank National Instruments Research for their technicalsupport during the design process The authors would alsolike to thank Dr Steven D Aird for language assistance andShizuka Kuda for logistical arrangements

Supplementary Materials

File 2D FFT-EAR-Demomp4 video demonstrating theFPGA output of 2D FFT with edge artifact removal(Supplementary Materials)

References

[1] R N Bracewell The fourier transform and iis applications vol5 New York NY USA 1965

[2] Theory and application of digital signal processing vol 1Prentice-Hall Inc Englewood Cliffs NJ USA 1975

[3] R A Brooks and G Di Chiro ldquoTheory of image reconstructionin computed tomographyrdquo Radiology vol 117 no 3 I pp 561ndash572 1975

[4] L Moisan ldquoPeriodic plus smooth image decompositionrdquo Jour-nal of Mathematical Imaging and Vision vol 39 no 2 pp 161ndash179 2011

[5] B Akin P A Milder F Franchetti and J C Hoe ldquoMemorybandwidth efficient two-dimensional fast Fourier transformalgorithm and implementation for large problem sizesrdquo inProceedings of the 20th IEEE International Symposium on Field-Programmable Custom Computing Machines FCCM 2012 pp188ndash191 Canada May 2012

[6] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[7] H Kee N Petersen J Kornerup and S S BhattacharyyaldquoSystematic generation of FPGA-based FFT implementationsrdquoin Proceedings of the 2008 IEEE International Conference onAcoustics Speech and Signal Processing ICASSP pp 1413ndash1416USA April 2008

[8] F Mahmood M Toots L-G Ofverstedt and U Skoglundldquo2DDiscrete Fourier Transformwith simultaneous edge artifactremoval for real-Time applicationsrdquo in Proceedings of the Inter-national Conference on Field Programmable Technology FPT2015 pp 236ndash239 New Zealand December 2015

[9] M Frigo and S G Johnson ldquoFFTW an adaptive software archi-tecture for the FFTrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing vol 3 pp1381ndash1384 IEEE May 1998

[10] M Puschel J M F Moura J R Johnson et al ldquoSPIRAL codegeneration for DSP transformsrdquo Proceedings of the IEEE vol 93no 2 pp 232ndash273 2005

[11] EWang Q Zhang B Shen et alHigh-Performance Computingon the Intel Xeon Phi Springer International Publishing 2014

[12] B Akin F Franchetti and J C Hoe ldquoFFTS with near-optimalmemory access through block data layoutsrdquo in Proceedings ofthe 2014 IEEE International Conference onAcoustics Speech andSignal Processing ICASSP 2014 pp 3898ndash3902 Italy May 2014

[13] B Akin F Franchetti and J C Hoe ldquoUnderstanding thedesign space of DRAM-optimized hardware FFT acceleratorsrdquoin Proceedings of the 25th IEEE International Conference onApplication-Specific Systems Architectures and Processors ASAP2014 pp 248ndash255 Switzerland June 2014

[14] C-L Yu K Irick C Chakrabarti and V Narayanan ldquoMul-tidimensional DFT IP generator for FPGA platformsrdquo IEEETransactions on Circuits and Systems I Regular Papers vol 58no 4 pp 755ndash764 2011

[15] DHe andQ Sun ldquoApractical print-scan resilientwatermarkingschemerdquo in Proceedings of the IEEE International Conference onImage Processing 2005 ICIP 2005 pp 257ndash260 Italy September2005

[16] D G Bailey Design for embedded image processing on FPGAsJohn Wiley Sons 2011

[17] B G Batchelor ldquoImplementing Machine Vision Systems UsingFPGAsrdquo in Machine Vision Handbook 1136 p 1103 SpringerLondon UK 2012

[18] T Lenart M Gustafsson and V Owall ldquoA hardware accelera-tion platform for digital holographic imagingrdquo Journal of SignalProcessing Systems vol 52 no 3 pp 297ndash311 2008

[19] G H Loh ldquo3D-Stacked Memory Architectures for Multi-coreProcessorsrdquo ACM SIGARCH Computer Architecture News vol36 no 3 pp 453ndash464 2008

[20] Z Qiuling B Akin H E Sumbul et al ldquoA 3D-stacked logic-in-memory accelerator for application-specific data intensivecomputingrdquo in Proceedings of the IEEE International 3D SystemsIntegration Conference pp 1ndash7 October 2013

[21] I S Uzun A Amira and A Bouridane ldquoFPGA implementa-tions of fast Fourier transforms for real-time signal and imageprocessingrdquo pp 283ndash296

[22] T Dillon ldquoTwo Virtex-II FPGAs deliver fastest cheapest besthigh-performance image processing systemrdquo rdquo Xilinx XcellJournal vol 41 pp 70ndash73 2001

[23] A Hast ldquoRobust and invariant phase based local featurematchingrdquo in Proceedings of the 22nd International Conferenceon Pattern Recognition ICPR 2014 pp 809ndash814 Sweden August2014

[24] B Galerne Y Gousseau and J-M Morel ldquoRandom phasetextures theory and synthesisrdquo IEEE Transactions on ImageProcessing vol 20 no 1 pp 257ndash267 2011

[25] R Hovden Y Jiang H L Xin and L F Kourkoutis ldquoPeriodicArtifact Reduction in Fourier Transforms of Full Field AtomicResolution Imagesrdquo Microscopy and Microanalysis vol 21 no2 pp 436ndash441 2014

[26] D G Bailey ldquoThe advantages and limitations of high levelsynthesis for FPGA based image processingrdquo in Proceedings ofthe the 9th International Conference pp 134ndash139 Seville SpainSeptember 2015

[27] W Wang B Duan C Zhang P Zhang and N Sun ldquoAcceler-ating 2D FFT with non-power-of-two problem size on FPGArdquoin Proceedings of the 2010 International Conference on Recon-figurable Computing and FPGAs ReConFig 2010 pp 208ndash213Mexico December 2010

[28] F NattererTheMathematics of Computerized Tomography JohnWiley amp Sons 1986

[29] R A Crowther D J DeRosier and A Klug ldquoThe Reconstruc-tion of aThree-Dimensional Structure from Projections and itsApplication to Electron Microscopyrdquo Proceedings of the RoyalSociety A Mathematical Physical and Engineering Sciences vol317 no 1530 pp 319ndash340 1970

International Journal of Reconfigurable Computing 17

[30] U Skoglund L-G Ofverstedt R M Burnett and G BricogneldquoMaximum-entropy three-dimensional reconstruction withdeconvolution of the contrast transfer function A test applica-tion with adenovirusrdquo Journal of Structural Biology vol 117 no3 pp 173ndash188 1996

[31] F Mahmood N Shahid P Vandergheynst and U SkoglundldquoGraph-based sinogram denoising for tomographic reconstruc-tionsrdquo in Proceedings of the 38th Annual International Confer-ence of the IEEE Engineering in Medicine and Biology SocietyEMBC 2016 pp 3961ndash3964 USA August 2016

[32] F Mahmood N Shahid U Skoglund and P VandergheynstldquoAdaptive Graph-Based Total Variation for TomographicReconstructionsrdquo IEEE Signal Processing Letters vol 25 no 5pp 700ndash704 2018

[33] L A Shepp and B F Logan Jr ldquoThe Fourier reconstruction of ahead sectionrdquo IEEE Transactions on Nuclear Science vol 21 no3 pp 21ndash43 1974

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 9: Algorithm and Architecture Optimization for 2D Discrete ...downloads.hindawi.com/journals/ijrc/2018/1403181.pdf · Algorithm and Architecture Optimization for 2D Discrete Fourier

International Journal of Reconfigurable Computing 9

Input 119861(119894 119895) of size 119899 times 119898Output (119904 119905)2DF(119861) lArrrArr 119861 F

119888119908997888997888997888rarr 119888119908 F119903119908997888997888997888rarr

Column-by-Column DFT via Symmetrical Short-cut1 119861sdot1

F997888rarr sdot12 while 1 lt 119895 lt 119898 do3 sdot119895 larr997888 1198871푗^4 end while5 sdot119898 larr997888 minussdot1 + (11988711 + 1198871푚)^6 119888119908 larr997888 119862119900119899119888119886119905119890119899119886119905119890 sdot1 sdot2 sdot119898minus1 sdot119898

Row-by-Row DFT7 119888119908

F119903119908997888997888997888rarr (119904 119905)

8 return (119904 119905)cw column-wisecolumn-by-columnrw row-wiserow-by-row

Algorithm 2 Proposed symmetrically optimized computation of (119904 119905)

DRA

M A

cces

s Poi

nts

2

18

16

14

12

1

08

06

04

02

0

Image Size (N=M)0 1000 2000 3000 4000 5000

DRAM Access - PSD vs OPSD

MirroringPSDOPSD (Proposed)

times108

Figure 5 Graph showing DRAM access (equal to number of DFT points to be computed) with increasing image size for mirroring periodicplus smooth decomposition (PSD) and our proposed optimized period plus smooth decomposition (OPSD)

the overall number of DFT computations required It shouldbe noted that such optimization is only possible for eithercolumn-wise or row-wise operations because after either ofthese operations the output is not symmetrical anymoreCompleting the column-wise operation first prevents stridedreading however this results in strided writing to the DRAMbefore row-wise traversal can startThis can be minimized bymaking efficient use of the local block RAM Output columnsare stored in the block RAM before being written to theDRAM in patches such that each row buffer access writeselements in several columns

5 Proposed Tile-Hopping Memory Mappingfor 2D FFTs

In this section we propose a tile-hopping external memoryaccess pattern for efficiently addressing external memory

during intermediate storage between row-wise and column-wise 1D FFT operations to calculate a 2D FFT As explainedin Section 22 and Figure 2 column-wise reads from DRAMcan be costly due to the overhead associated with activatingand precharging In the worst case scenario it can limitDRAM bandwidth up to 80 [26] This is a problem withall such image processing operations where one stage of theprocessing has to be completed on all elements before thenext stage can start In the past there have been severalimplementations using localmemory however with growingdemand for larger image sizes external memory has to beused There have been several DRAM remapping attemptsbefore such as [5 13] They propose a tile-based approachwhere an 119899 times 119899 image (input array) is divided into 119899119896 times 119899119896tiles where 119896 is the size of the DRAM row buffer whichallows for very high-bandwidthDRAMaccess Although this

10 International Journal of Reconfigurable Computing

Table 1 Comparing mirroring PSD and OPSD

Algorithm DRAM Access DFTPoints Points

Mirroring 8119873119872 8119873119872P + S Decomposition (PSD) 4119873119872 4119873119872Optimized PSD (Proposed) 3119873119872 +119873 +119872 minus 1 3119873119872 +119872method may be ideal to maximize the DRAM performancefor 2D FFTs it incurs a high resource cost associated withlocal memory transposition and storing large chunks of data(entire rowcolumn of tiles) in the local memory Moreovertiling in the image domain also requires remapping row-by-row operations Another approach to reducing stridedDRAM access has been presented in [27] They present a2D decomposition algorithmwhich decomposes the probleminto smaller subblock 2D FFTs which can be performedlocallyThis introduces extra row and columndata exchangesand total number of operations are increased fromO(1198992 log 119899)toO(1198992(1+log 119899)) Other implementations do not address theexternal memory issue in detail

We propose tile-hopping address mapping which reducesthe number of row activations required to access a singlecolumn Unlike [5] our approach does not require significantlocal operations or storage The proposed memory mappingcontroller was designed on top of LabVIEW FPGArsquos existingmemory controller which efficiently controls interleaving andissues activation and precharge commands in parallel withdata transfer The reduced number of row activations alsoreduces the amount of energy required by the DRAM Thiswill be further discussed in the energy evaluation presentedin Section 64 and Table 3 where we demonstrate that theproposed tile-hopping address mapping can reduce energyconsumption up to 53

Instead of writing the results of the row-by-row 1D FFTin row-major order we remap the results in a blocked or tiledpattern as shown in Figure 6This means that when accessingan image column several elements of that column can beretrieved from a single DRAM row access For an 119899times119899 imageeach row of size 119899 can be divided into ℎ tiles (ie 119899 = ℎ119873(119905)where119873(119905) is the number of elements in each tile)These tilescan be remapped onto the DRAM floor as shown in Figure 6If the size of the row is small enough it may be possible toconvert it into a single tile (ie ℎ = 1) However this isunlikely for realistic image sizes For a tile of size119901times119902 a singlerow of the image is written into the DRAM by transitioningthrough 119901ℎ rows If 119896 is the size of the row buffer thereare 119896119902 distinct tiles represented in each DRAM row and itcontains the same number of elements from a single imagecolumn Given regular row-major storage when accessingcolumn-wise elements one would have to transition through119899 DRAM rows to read a single image column Howeverwith this approach when accessing an image column 119896119902elements of that column could be read from a single DRAMrow which has been referred to in the row buffer Althoughthe cost of writing an image row is higher when compared toa standard row-major DRAM writing pattern (ie referring

to119901ℎ rather than 119899 rows) the number ofDRAMrow referralsduring column-wise read is reduced to 119899119902119896 which is 119899(1 minus119902119896) less row referrals for a single column read

We refer to this method as tile-hopping because it entailsmapping data onto several DRAM tiles and then hoppingbetween the tiles such that several elements of the imagecolumn exist in a DRAM row which has been referred toin the row bank Although this mapping scheme has beendeveloped for column-wise access required during 2D FFTcalculation the scheme is general and can be adapted to otherapplications Performance evaluation of thismethod has beenpresented in the experiments section

6 Experimental Results and Analysis

61 Hardware Configuration and Target Selection Since 2DDFTs are usually used for simplifying convolution operationsin complex image processing andmachine vision systems weneeded to prototype our design on a system that is expandablefor next levels of processing As mentioned earlier for rapidprototyping of our proposed OPSD algorithm and tile-hopping memory mapping scheme we used a PXIe-basedreconfigurable system PXIe is an industrial extension ofa PCI system with an enhanced bus structure that giveseach connected device dedicated access to the bus with amaximum throughput of 24GBs This allows a high-speeddedicated link between a host PC and several FPGAs TheLabVIEWFPGAgraphical design environment is efficient forrapid prototyping of complicated signal and image processingsystems It allows us to effectively integrate external HDLcode and LabVIEW graphical design on a single platformMoreover it allows a combination of high-level synthesis(HLS) and custom logic Since current HLS tools havelimitations when it comes to complex image and signalprocessing tasks LabVIEW FPGA tries to bridge these gapsby streamlining the design process

We used FlexRIO (Flexible Reconfigurable IO) FPGAboards plugged into a PXIe chassis PXIe FlexRIO FPGAboards are adaptable and can be used to achieve highthroughput because they allow direct data transfer betweenmultiple FPGAs at rates as high as 8GBs This can signifi-cantly simplify multi-FPGA systems which usually commu-nicate via a host PC This feature allows expansion of oursystem to further processing stages making it flexible for avariety of applications Figure 7 shows a basic overview ofa PXIe-based multi-FPGA system with a host PC controllerconnected through a high-speed bus on a PXIe chassis

Specifically we used two NI PXIe-7976R FlexRIO boardswhich have Kintex 7 FPGA and 2GB external DRAM withtheoretical data bandwidth up to 10GBs This FPGA boardwas plugged into a PXIe-1085 chassis along with a PXIe-8880Intel Xeon PC controller PXIe-1085 can hold up to 16 FPGAsand has 8 GBs per-slot dedicated bandwidth and an overallsystem bandwidth of 24 GBs

62 Experimental Setup As per Algorithms 1 and 2 dis-cussed in previous sections implementation involves fivestages (A) calculating the 2D FFT of an image frame (B)

International Journal of Reconfigurable Computing 11

n

n

Row-by-Row (RR) Output

(a) Dataset view

k

p

q

Writing RR Output to DRAM

Row Buffer

(b) Memory view

k

Reading from DRAM

Row Buffer

(c) Memory view

Figure 6 Image showing tile-hopping (a) Image-level view showing tiles (b) DRAM-level view showing tile placement while writing (c)DRAM-level view showing column reading from the tiles

External IO PXIe

Host PC ControllerDisplay

Device

PXIeFPGA Board 1

High-throughput PXIe BUSPXIe Chassis

External IO

PXIeFPGA Board 2

External IO

PXIeFPGA Board 3

External IO

Figure 7 Block diagram of a PXIe-based multi-FPGA system witha host PC controller connected through a high-speed bus on a PXIechassis [8]

calculating the boundary image (C) calculating the 2DFFT of the boundary image (D) calculating the smoothcomponent and (E) subtracting the smooth component fromthe 2D FFT of the original image to achieve the periodiccomponent The bottleneck consistently occurs in A and theserial part of the algorithm (A 997888rarr B 997888rarr C) The limitation

due to A is reduced removing the so-called ldquomemory wallrdquoby using our proposed tile-hopping-based memory mappingThe limitations due to the serial part of the algorithm arereduced by using OPSD rather than PSD For quantificationthe delay for A is 062ms for a 512 times 512 image

The design flow presented in Figure 4 was followedData-flow is clearly shown in a graphical programmingenvironment making it easier to visualize how a designefficiently fits on an FPGA Highly efficient implementationsof 1D FFT were used from LabVIEW FPGA for parallelrow-by-row operations and by integrating Xilinx LogiCOREfor column-by-column operations Each stage of the designwas dynamically tested and benchmarked The image wasstreamed from the host PC using a Direct Memory Access(DMA) FIFO 1D FFTs are performed in parallel rows of 8and stored in the DRAM via local memory in a tiled patternas explained in the previous section This follows readingseveral rows to extract a single column which is Fourier-transformed using Xilinx LogiCORE and is sent back to

12 International Journal of Reconfigurable Computing

I(00)

I(n-1 m-1)

i

j

N x M size image

a b

a

1D FFT

c

Ori

gina

l Im

age

Boun

dary

Imag

e

TrigonometricFunctions

Equation (7)

FFT ofSmooth Component

FFT of Periodic Component(With Edge Artifact

Removed)

b

Shortcut for Column-wise 1D FFT

Exte

rnal

Mem

ory

Exte

rnal

Mem

ory

Row-wise 1D FFT

c

Bloc

k M

emor

y

Exte

rnal

Mem

ory

Row-wise 1D FFT

Host PC Controller(NI PXIe-8880)

FPGA-1(NI PXIe-7976R)

Host PC Controller (NI PXIe-8880)

PXIe Chassis (NI-PXIe1085)

FPGA-2(NI PXIe-7976R)Block Memory

Bloc

k M

emor

y

+

minusColumn-wise 1D FFTlowast

Row-Major Memory Mapping Tile-Hopping Memory Mapping

I(n-1 j) ndash I(0 j)

I(i

0) ndash

I(i

m-1

)

I(i

m-1

) ndash I(

i 0)

I(0 j) ndash I(n-1 j)

middot1

b1jv

-middot1 + (b11+b1m)v

Figure 8 Functional block diagramof PXIe-based 2DFFT implementationwith simultaneous edge artifact removal using optimized periodicplus smooth decomposition The OPSD algorithm is split among two NI-7976R (Kintex-7) FPGA boards with 2GB external memory and ahost PC connected over a high-bandwidth bus The image is streamed from the PC controller to FPGA 1 and FPGA 2 FPGA 1 calculates therow-by-row 1D FFT followed by column-by-column 1D FFT with intermediate tile-hopping memory mapping and sends the result back tothe host PC FPGA 2 receives the image calculates the boundary image and proceeds to calculate the 1D FFT column-by-column FFT usingthe shortcut presented in (16) followed by row-by-row 1D FFTs and the result is sent back to the host PC

Camera

HostPC

External Memory

FPGA Board 1

FPGA

DMAFIFO

FIFO IN(Local Memory - (Local Memory ndash

Write)

FIFO OUT

Read)

PXIe

BU

S

CU

1D FF4H

1D FF41

1D FF42

Figure 9 Block diagram of 2D FFT showing data transfer betweenexternal memory and local memory scheduled via a Control Unit(CU)

the host PC If the image is being streamed directly froman imaging device which scans and provides random ornonlinear sequence of rows it is necessary to store a frameof the image in a buffer This can also be accomplishedby streaming the image flow from the host PC or using asmart camera which can delay image delivery by a singleframe Local memory shown in Figure 9 is used to bufferdata between external memory and 1D FFT cores Thismemory is divided into read and write components and isimplemented using FPGA slices Block RAM (BRAM) is usedfor temporary storage of vectors required for calculating the2D FFT of the boundary image (in the case of FPGA 2)The Control Unit (CU) organizes scheduling for transferring

data between local and external memory CU is based onLabVIEWrsquos existing memory controller and our memorymapping scheme presented in Section 5

Step B was accomplished using standard LabVIEWFPGA HLS tools for programming (5) using the graphicalprogramming environment In step C the 2D FFT of theboundary image needs to be calculated by row and columndecomposition However as shown mathematically in theprevious section the initial row-wise FFTs can be calculatedby computing the 1D FFT of the first (boundary) vector andthe FFTs of reaming vectors can be computed by appropriatescaling of this vector

We need the boundary column vector for 1D FFT calcula-tion of the first and last columns We also need the boundaryrow vector for appropriate scaling of for the 1D FFT ofevery column between the first and last columns Row andcolumn vectors of the boundary image are stored in blockRAM (BRAM) Figure 8 shows a functional block diagramof the overall 2D FFT with optimized PSD process Steps Dand E are performed on the host PC to minimize memoryclashes and to access the periodic and smooth componentsof each frame as they become available 86 of resources areused on FPGA 1 and 41 resources are used on FPGA 2 Theresource utilization is reported according to LabVIEWFPGAsynthesis and compilation experiments It should be notedthat part of the reason for high resource utilization is becauseof using LabVIEW FPGA high-level synthesis tools as wellas Xilinx LogiCORE tools Using standardized tools makesit harder to optimize for resource utilization The currentimplementation was optimized for performance in terms ofrun time and energy consumption

International Journal of Reconfigurable Computing 13

Fram

esS

econ

d (1

s)

105

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7)Kintex-7 (No DRAM Opt)Kintex-7 (DRAM Opt)Theoretical Peak

(a) 2D FFT performance DRAM tile-hopping

Fram

esS

econ

d (1

s)

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7) (PSD)Kintex-7 (PSD)Kintex-7 (OPSD)

(b) 2D FFT with EAR performance PSD versus OPSD

Figure 10 Performance evaluation in terms of frames per second for (a) 2D FFTs with tile-hopping memory pattern (b) 2D FFTs with edgeartifact removal (EAR) using OPSDThe performance evaluation shows the significance of the two optimizations proposed Both axes are ona log scale

63 Performance Evaluation The overall performance of thesystem was evaluated using the setup presented in Figure 9The data was streamed from the host PC in certain caseshigh frame rate videos as well as direct camera input werestreamed from the host All results presented are for 16-bitfixed-point precision Figure 10(a) presents the effectivenessof our proposed tile-hopping memory mapping scheme Itclearly shows the effectiveness of our proposed memorymapping since it is closer to the theoretical peak performanceFigure 10(b) presents the overall results comparing PSD andOPSD and demonstrating the effectiveness of our proposedoptimization PSD was also implemented on the same plat-form but the optimization presented in Section 4 was notused This rendered the serial portion of the algorithm to bethe bottleneck which reduced overall performance Table 2shows a comparison of our implementation in contrast torecent 2D FFT FPGA implementation approaches and showsthat we achieve a better performance even with simultaneousedge artifact removal Although our implementation is testedwith 16-bit fixed-point precision which limits the accuracy ofthe transform the precision may be sufficient for a varietyof speed critical applications where alternative edge artifactremoval methods (eg filtering) may decrease overall systemperformance Although the dynamic range of fixed-pointdata is smaller than floating-point data which can lead toerrors in the 2D FFT for certain applications it is moreimportant to have an artifact-free transform as compared toa highly accurate transform

64 Energy Evaluation The overall energy consumptionof the custom computing system depends on (1) powerperformance of the system components and (2) throughputdisparity Throughput disparity results in idle time for at

least one of the components and lowers the overall sys-tem throughput A throughput-optimized system minimizesinstances where certain components of the architecture areidle The proposed optimizations in Sections 4 and 5 clearlyreduce throughput disparity andminimize the idle time of thesystem Thus besides causing delays due to significant over-head standard column-wise DRAM access also contributesto the overall energy consumption This is not only due tohigh count of DRAM row charges but also because of energyconsumed by the FPGA in idle state Ideally maximizing theDRAMbandwidth limits the amount of energy consumptionProposed ldquotile-hoppingrdquo memory mapping scheme improvesthe DRAMbandwidth as seen in Figure 10 and hence reducesthe overall energy consumption The same is true for theproposed OPSD method where reduced DRAM access and1D FFT invocations lead to reduced energy consumption Inthis section we analyze the amount of improvement in energyconsumption based on the proposed optimizations

We estimate the DRAM power consumption for boththe baseline (standard strided) and the optimized (ldquotile-hoppingrdquo) memory access using MICRON DRAM powercalculator The energy is calculated in 119899119869 for each readie energy per read This is accomplished by calculatingthe run time for a specific 2D FFT and estimating theamount of energy consumed by the DRAM power calculatorTable 4 depicts the DRAM energy consumption for 2D FFTsbefore and after the proposed ldquotile-hoppingrdquo optimization forcolumn-wise DRAM access As mentioned earlier row-wiseDRAM access is fast and row buffer size data can be accessedby a single row activation According to Table 4 the energyrequired for DRAM access is reduced by 427 488 and529 for 1024 times 1024 2048 times 2048 and 4096 times 4096 size 2DFFTs respectively

14 International Journal of Reconfigurable Computing

Table 2 Comparison of OPSD1 2D FFT with regular RCD-based implementation

Platform SEAR2 Precision RT3 (ms)YesNo bits 512 times 512 1024 times 1024

Kintex 7 28nm (ours) Yes 16 (fixed) 15 48Kintex 7 28nm (ours) No 16 (fixed) 09 41Kintex 7 28mm [8] Yes 16 (fixed) 324 1167Stratix IV [5] No 64 (double) - 61Virtex-5-BEE3 65nm[14] No 32 (single) 249 1026Virtex-E 180nm [21] No 16 (fixed) 286 769ASIC 180nm No 32 (single) 210 -1 Optimized periodic + smooth decomposition (OPSD)2 Simultaneous edge artifact removal3 Runtime (ms)

Table 3 DRAM energy consumption baseline vs tile-hopping1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPRlowast CW∘ Read(Baseline) 446 577 712EPR CW ReadTile-Hopping 254 295 336(Proposed)Reduction () 427 488 529lowastEnergy per read (EPR)∘Column-wise memory access (CW)

Table 4 2D FFT + EAR energy consumption baseline vs optimized (OPSD + tile hopping)1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPPdagger 2D FFT + EARBaseline 3692 4125 48352D FFT+EAR (Opt)Tile-Hopping + OPSD 1588 1706 1811(Proposed)Improvement 23times 24times 28timesdaggerEnergy per point (EPP)

The metric used to compare the overall energy optimiza-tion achieved for 2D FFTs with EAR is energy per point iethe amount of average energy required to compute the 2DFFT of a single point in an image with simultaneous edgeartifact removal This was achieved by calculating the energyconsumed by Xilinx LogiCORE IP for 1D FFTs the DRAMand the edge artifact removal part separately The estimatedenergy calculated does not include energy consumed by thePXIe chassis and the host PC Essentially the FPGA-basedarchitecture presented here could be used without the hostcontroller The energy consumption incorporates dynamicas well as static power The overall energy consumption perpoint is reduced by 569 586 and 62 for calculating1024 times 1024 2048 times 2048 and 4096 times 4096 size 2D FFTs withEAR respectively

7 Application Filtered Back-Projectionfor Tomography

In order to further demonstrate the effectiveness of ourimplementation we use the created 2D FFT module asan accelerator for reducing the run time for filtered back-projection (FBP) FBP is a fundamental analytical tomo-graphic image reconstruction method In depth detailsregarding the basic FBP algorithm have been left out forbrevity but can be found in [28 29]Themethod can be usedto reconstruct primitive 3D tomograms from 2D data whichcan then be used as a basis for more complex regularization-basedmethods such as [30ndash32]The algorithmic flow is basedon the Fourier slice theorem ie 2D Fourier transformsof projections are an angular component of the 3D Fourier

International Journal of Reconfigurable Computing 15

Table 5 Comparing filtered back-projection (FBP) runtime (as an application for using the proposed 2D FFT with simultaneous EAR)

3D Density CPU (i7) FPGA + Host PC (i7)Sec Sec128 times 128 times 128 213 sec 195 sec256 times 256 times 256 475 sec 424 sec512 times 512 times 512 948 sec 813 sec1024 times 1024 times 1024 3223 sec 2753 sec2048 times 2048 times 2048 16877 sec 13644 sec4096 times 4096 times 4096 164631 sec 125994 sec

Original Phantom CPU Reconstructed Phantom FPGA+CPU Reconstructed Ph

Inte

nsity

Inte

nsity

1

08

06

04

02

0

1

08

06

04

02

0

X Direction0 20 40 60 80 100 120

X Direction0 20 40 60 80 100 120

Original PhantomCPU BP

Original PhantomFPGA+CPU BP

Figure 11 Figure showing a thin slice of filtered back-projectionresults by reconstructing a 128 times 128 times 128 Shepp-Logan phantomThe 3Ddensitywas reconstructed from 180 equally spaced simulatedprojections using standard linearly interpolated FBP and using aRam-Lak filter It can be seen that the results from the FPGA + CPUsolution have some errors this is due to the fact that the 2D FFT ofeach projection is less accurate The results are good enough to beused as a basis for further optimization based refinement methods

transform of the 3D reconstructed volume Our 2D FFTaccelerator was used to calculate the 2D FFTs of the projec-tions as well as for initial stages of the 3D FFTwhich was thencompleted on the host PC Similar to the 2D FFT the 3D FFTis separable and can be divided into 2D FFTs and 1D FFTsThe results have been shown in Table 5 It can be seen thatthe improvement for smaller size densities is not significantbecause their FFTs are quite fast on general-purpose CPUsHowever for larger densities the FFT accelerator can give asignificant improvement If the remaining components arealso implemented on an FPGA significant speed increasecan be achieved Results of a thin slice from a 3D simulatedShepp-Logan [33] phantom have been shown in Figure 11 Itcan be seen that the results from the hardware acceleratedFBP are of slightly lower quality This is due to the factthat our 2D FFT implementation is less accurate (16 bitfixed-point) as compared to the CPU-based implementation(FFTW double-precision floating) The accelerated FBP wasalso tested with real Electron Tomography (ET) data

8 Conclusion

2D FFTs often become a major bottleneck for high-performance imaging and vision systems The inherentcomputational complexity of the 2D FFT kernel is fur-ther enhanced if effective removal (using PSD) of spuriousartifacts introduced by the nonperiodic nature of real-lifeimages is taken into accountWe developed and implementedan FPGA-based design for calculating high-throughput 2DDFTs with simultaneous edge artifact removal Our approachis based on a PSD algorithm that splits the frequency domainof a 2D image into a smooth component which contains thehigh-frequency cross-shaped artifacts and can be subtractedfrom the 2D DFT of the original image to obtain a periodiccomponent that is artifact-free Since this approach calculatestwo 2D DFTs simultaneously external memory addressingand repeated 1D FFT invocations become problematic Tosolve this problem we optimized the original PSD algorithmto reduce the number of DFT samples to be computed andDRAM access Moreover to reduce strided access from theDRAMduring column-wise reads we presented and analyzedldquotile-hoppingrdquo a memorymapping scheme which reduces thenumber of DRAM row activations when reading a singlecolumn of data This memory mapping scheme is generaland may be used for a variety of other applications Wedemonstrate that ldquotile-hoppingrdquomemorymapping can reducethe DRAM energy consumption by 529 Moreover weshow that the proposed optimizations lead to 28times less energyconsumption for the overall 2D FFT with EAR architecture

Our methods were tested using extensive synthesis andbenchmarking using a Xilinx Kintex 7 FPGA communicatingwith a host PC on a high-speed PXIe bus Our system isexpandable to support several FPGAs and can be adaptedto various large-scale computer vision and biomedical appli-cations Despite decomposing the image into periodic andsmooth frequency components our design requires lessrun time compared to traditional FPGA-based 2D DFTimplementation approaches and can be used for a varietyof highly demanding applications One such applicationfiltered back-projection was accelerated using the proposedimplementation to achieve better results specifically for largersize raw tomographic data

Conflicts of Interest

The authors declare that they have no conflicts of interest

16 International Journal of Reconfigurable Computing

Acknowledgments

Thisworkwas supported by JapaneseGovernmentOIST Sub-sidy for Operations (Ulf Skoglund) Grant no 5020S7010020FaisalMahmood andMart Toots were additionally supportedby the OIST PhD Fellowship The authors would like tothank National Instruments Research for their technicalsupport during the design process The authors would alsolike to thank Dr Steven D Aird for language assistance andShizuka Kuda for logistical arrangements

Supplementary Materials

File 2D FFT-EAR-Demomp4 video demonstrating theFPGA output of 2D FFT with edge artifact removal(Supplementary Materials)

References

[1] R N Bracewell The fourier transform and iis applications vol5 New York NY USA 1965

[2] Theory and application of digital signal processing vol 1Prentice-Hall Inc Englewood Cliffs NJ USA 1975

[3] R A Brooks and G Di Chiro ldquoTheory of image reconstructionin computed tomographyrdquo Radiology vol 117 no 3 I pp 561ndash572 1975

[4] L Moisan ldquoPeriodic plus smooth image decompositionrdquo Jour-nal of Mathematical Imaging and Vision vol 39 no 2 pp 161ndash179 2011

[5] B Akin P A Milder F Franchetti and J C Hoe ldquoMemorybandwidth efficient two-dimensional fast Fourier transformalgorithm and implementation for large problem sizesrdquo inProceedings of the 20th IEEE International Symposium on Field-Programmable Custom Computing Machines FCCM 2012 pp188ndash191 Canada May 2012

[6] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[7] H Kee N Petersen J Kornerup and S S BhattacharyyaldquoSystematic generation of FPGA-based FFT implementationsrdquoin Proceedings of the 2008 IEEE International Conference onAcoustics Speech and Signal Processing ICASSP pp 1413ndash1416USA April 2008

[8] F Mahmood M Toots L-G Ofverstedt and U Skoglundldquo2DDiscrete Fourier Transformwith simultaneous edge artifactremoval for real-Time applicationsrdquo in Proceedings of the Inter-national Conference on Field Programmable Technology FPT2015 pp 236ndash239 New Zealand December 2015

[9] M Frigo and S G Johnson ldquoFFTW an adaptive software archi-tecture for the FFTrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing vol 3 pp1381ndash1384 IEEE May 1998

[10] M Puschel J M F Moura J R Johnson et al ldquoSPIRAL codegeneration for DSP transformsrdquo Proceedings of the IEEE vol 93no 2 pp 232ndash273 2005

[11] EWang Q Zhang B Shen et alHigh-Performance Computingon the Intel Xeon Phi Springer International Publishing 2014

[12] B Akin F Franchetti and J C Hoe ldquoFFTS with near-optimalmemory access through block data layoutsrdquo in Proceedings ofthe 2014 IEEE International Conference onAcoustics Speech andSignal Processing ICASSP 2014 pp 3898ndash3902 Italy May 2014

[13] B Akin F Franchetti and J C Hoe ldquoUnderstanding thedesign space of DRAM-optimized hardware FFT acceleratorsrdquoin Proceedings of the 25th IEEE International Conference onApplication-Specific Systems Architectures and Processors ASAP2014 pp 248ndash255 Switzerland June 2014

[14] C-L Yu K Irick C Chakrabarti and V Narayanan ldquoMul-tidimensional DFT IP generator for FPGA platformsrdquo IEEETransactions on Circuits and Systems I Regular Papers vol 58no 4 pp 755ndash764 2011

[15] DHe andQ Sun ldquoApractical print-scan resilientwatermarkingschemerdquo in Proceedings of the IEEE International Conference onImage Processing 2005 ICIP 2005 pp 257ndash260 Italy September2005

[16] D G Bailey Design for embedded image processing on FPGAsJohn Wiley Sons 2011

[17] B G Batchelor ldquoImplementing Machine Vision Systems UsingFPGAsrdquo in Machine Vision Handbook 1136 p 1103 SpringerLondon UK 2012

[18] T Lenart M Gustafsson and V Owall ldquoA hardware accelera-tion platform for digital holographic imagingrdquo Journal of SignalProcessing Systems vol 52 no 3 pp 297ndash311 2008

[19] G H Loh ldquo3D-Stacked Memory Architectures for Multi-coreProcessorsrdquo ACM SIGARCH Computer Architecture News vol36 no 3 pp 453ndash464 2008

[20] Z Qiuling B Akin H E Sumbul et al ldquoA 3D-stacked logic-in-memory accelerator for application-specific data intensivecomputingrdquo in Proceedings of the IEEE International 3D SystemsIntegration Conference pp 1ndash7 October 2013

[21] I S Uzun A Amira and A Bouridane ldquoFPGA implementa-tions of fast Fourier transforms for real-time signal and imageprocessingrdquo pp 283ndash296

[22] T Dillon ldquoTwo Virtex-II FPGAs deliver fastest cheapest besthigh-performance image processing systemrdquo rdquo Xilinx XcellJournal vol 41 pp 70ndash73 2001

[23] A Hast ldquoRobust and invariant phase based local featurematchingrdquo in Proceedings of the 22nd International Conferenceon Pattern Recognition ICPR 2014 pp 809ndash814 Sweden August2014

[24] B Galerne Y Gousseau and J-M Morel ldquoRandom phasetextures theory and synthesisrdquo IEEE Transactions on ImageProcessing vol 20 no 1 pp 257ndash267 2011

[25] R Hovden Y Jiang H L Xin and L F Kourkoutis ldquoPeriodicArtifact Reduction in Fourier Transforms of Full Field AtomicResolution Imagesrdquo Microscopy and Microanalysis vol 21 no2 pp 436ndash441 2014

[26] D G Bailey ldquoThe advantages and limitations of high levelsynthesis for FPGA based image processingrdquo in Proceedings ofthe the 9th International Conference pp 134ndash139 Seville SpainSeptember 2015

[27] W Wang B Duan C Zhang P Zhang and N Sun ldquoAcceler-ating 2D FFT with non-power-of-two problem size on FPGArdquoin Proceedings of the 2010 International Conference on Recon-figurable Computing and FPGAs ReConFig 2010 pp 208ndash213Mexico December 2010

[28] F NattererTheMathematics of Computerized Tomography JohnWiley amp Sons 1986

[29] R A Crowther D J DeRosier and A Klug ldquoThe Reconstruc-tion of aThree-Dimensional Structure from Projections and itsApplication to Electron Microscopyrdquo Proceedings of the RoyalSociety A Mathematical Physical and Engineering Sciences vol317 no 1530 pp 319ndash340 1970

International Journal of Reconfigurable Computing 17

[30] U Skoglund L-G Ofverstedt R M Burnett and G BricogneldquoMaximum-entropy three-dimensional reconstruction withdeconvolution of the contrast transfer function A test applica-tion with adenovirusrdquo Journal of Structural Biology vol 117 no3 pp 173ndash188 1996

[31] F Mahmood N Shahid P Vandergheynst and U SkoglundldquoGraph-based sinogram denoising for tomographic reconstruc-tionsrdquo in Proceedings of the 38th Annual International Confer-ence of the IEEE Engineering in Medicine and Biology SocietyEMBC 2016 pp 3961ndash3964 USA August 2016

[32] F Mahmood N Shahid U Skoglund and P VandergheynstldquoAdaptive Graph-Based Total Variation for TomographicReconstructionsrdquo IEEE Signal Processing Letters vol 25 no 5pp 700ndash704 2018

[33] L A Shepp and B F Logan Jr ldquoThe Fourier reconstruction of ahead sectionrdquo IEEE Transactions on Nuclear Science vol 21 no3 pp 21ndash43 1974

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 10: Algorithm and Architecture Optimization for 2D Discrete ...downloads.hindawi.com/journals/ijrc/2018/1403181.pdf · Algorithm and Architecture Optimization for 2D Discrete Fourier

10 International Journal of Reconfigurable Computing

Table 1 Comparing mirroring PSD and OPSD

Algorithm DRAM Access DFTPoints Points

Mirroring 8119873119872 8119873119872P + S Decomposition (PSD) 4119873119872 4119873119872Optimized PSD (Proposed) 3119873119872 +119873 +119872 minus 1 3119873119872 +119872method may be ideal to maximize the DRAM performancefor 2D FFTs it incurs a high resource cost associated withlocal memory transposition and storing large chunks of data(entire rowcolumn of tiles) in the local memory Moreovertiling in the image domain also requires remapping row-by-row operations Another approach to reducing stridedDRAM access has been presented in [27] They present a2D decomposition algorithmwhich decomposes the probleminto smaller subblock 2D FFTs which can be performedlocallyThis introduces extra row and columndata exchangesand total number of operations are increased fromO(1198992 log 119899)toO(1198992(1+log 119899)) Other implementations do not address theexternal memory issue in detail

We propose tile-hopping address mapping which reducesthe number of row activations required to access a singlecolumn Unlike [5] our approach does not require significantlocal operations or storage The proposed memory mappingcontroller was designed on top of LabVIEW FPGArsquos existingmemory controller which efficiently controls interleaving andissues activation and precharge commands in parallel withdata transfer The reduced number of row activations alsoreduces the amount of energy required by the DRAM Thiswill be further discussed in the energy evaluation presentedin Section 64 and Table 3 where we demonstrate that theproposed tile-hopping address mapping can reduce energyconsumption up to 53

Instead of writing the results of the row-by-row 1D FFTin row-major order we remap the results in a blocked or tiledpattern as shown in Figure 6This means that when accessingan image column several elements of that column can beretrieved from a single DRAM row access For an 119899times119899 imageeach row of size 119899 can be divided into ℎ tiles (ie 119899 = ℎ119873(119905)where119873(119905) is the number of elements in each tile)These tilescan be remapped onto the DRAM floor as shown in Figure 6If the size of the row is small enough it may be possible toconvert it into a single tile (ie ℎ = 1) However this isunlikely for realistic image sizes For a tile of size119901times119902 a singlerow of the image is written into the DRAM by transitioningthrough 119901ℎ rows If 119896 is the size of the row buffer thereare 119896119902 distinct tiles represented in each DRAM row and itcontains the same number of elements from a single imagecolumn Given regular row-major storage when accessingcolumn-wise elements one would have to transition through119899 DRAM rows to read a single image column Howeverwith this approach when accessing an image column 119896119902elements of that column could be read from a single DRAMrow which has been referred to in the row buffer Althoughthe cost of writing an image row is higher when compared toa standard row-major DRAM writing pattern (ie referring

to119901ℎ rather than 119899 rows) the number ofDRAMrow referralsduring column-wise read is reduced to 119899119902119896 which is 119899(1 minus119902119896) less row referrals for a single column read

We refer to this method as tile-hopping because it entailsmapping data onto several DRAM tiles and then hoppingbetween the tiles such that several elements of the imagecolumn exist in a DRAM row which has been referred toin the row bank Although this mapping scheme has beendeveloped for column-wise access required during 2D FFTcalculation the scheme is general and can be adapted to otherapplications Performance evaluation of thismethod has beenpresented in the experiments section

6 Experimental Results and Analysis

61 Hardware Configuration and Target Selection Since 2DDFTs are usually used for simplifying convolution operationsin complex image processing andmachine vision systems weneeded to prototype our design on a system that is expandablefor next levels of processing As mentioned earlier for rapidprototyping of our proposed OPSD algorithm and tile-hopping memory mapping scheme we used a PXIe-basedreconfigurable system PXIe is an industrial extension ofa PCI system with an enhanced bus structure that giveseach connected device dedicated access to the bus with amaximum throughput of 24GBs This allows a high-speeddedicated link between a host PC and several FPGAs TheLabVIEWFPGAgraphical design environment is efficient forrapid prototyping of complicated signal and image processingsystems It allows us to effectively integrate external HDLcode and LabVIEW graphical design on a single platformMoreover it allows a combination of high-level synthesis(HLS) and custom logic Since current HLS tools havelimitations when it comes to complex image and signalprocessing tasks LabVIEW FPGA tries to bridge these gapsby streamlining the design process

We used FlexRIO (Flexible Reconfigurable IO) FPGAboards plugged into a PXIe chassis PXIe FlexRIO FPGAboards are adaptable and can be used to achieve highthroughput because they allow direct data transfer betweenmultiple FPGAs at rates as high as 8GBs This can signifi-cantly simplify multi-FPGA systems which usually commu-nicate via a host PC This feature allows expansion of oursystem to further processing stages making it flexible for avariety of applications Figure 7 shows a basic overview ofa PXIe-based multi-FPGA system with a host PC controllerconnected through a high-speed bus on a PXIe chassis

Specifically we used two NI PXIe-7976R FlexRIO boardswhich have Kintex 7 FPGA and 2GB external DRAM withtheoretical data bandwidth up to 10GBs This FPGA boardwas plugged into a PXIe-1085 chassis along with a PXIe-8880Intel Xeon PC controller PXIe-1085 can hold up to 16 FPGAsand has 8 GBs per-slot dedicated bandwidth and an overallsystem bandwidth of 24 GBs

62 Experimental Setup As per Algorithms 1 and 2 dis-cussed in previous sections implementation involves fivestages (A) calculating the 2D FFT of an image frame (B)

International Journal of Reconfigurable Computing 11

n

n

Row-by-Row (RR) Output

(a) Dataset view

k

p

q

Writing RR Output to DRAM

Row Buffer

(b) Memory view

k

Reading from DRAM

Row Buffer

(c) Memory view

Figure 6 Image showing tile-hopping (a) Image-level view showing tiles (b) DRAM-level view showing tile placement while writing (c)DRAM-level view showing column reading from the tiles

External IO PXIe

Host PC ControllerDisplay

Device

PXIeFPGA Board 1

High-throughput PXIe BUSPXIe Chassis

External IO

PXIeFPGA Board 2

External IO

PXIeFPGA Board 3

External IO

Figure 7 Block diagram of a PXIe-based multi-FPGA system witha host PC controller connected through a high-speed bus on a PXIechassis [8]

calculating the boundary image (C) calculating the 2DFFT of the boundary image (D) calculating the smoothcomponent and (E) subtracting the smooth component fromthe 2D FFT of the original image to achieve the periodiccomponent The bottleneck consistently occurs in A and theserial part of the algorithm (A 997888rarr B 997888rarr C) The limitation

due to A is reduced removing the so-called ldquomemory wallrdquoby using our proposed tile-hopping-based memory mappingThe limitations due to the serial part of the algorithm arereduced by using OPSD rather than PSD For quantificationthe delay for A is 062ms for a 512 times 512 image

The design flow presented in Figure 4 was followedData-flow is clearly shown in a graphical programmingenvironment making it easier to visualize how a designefficiently fits on an FPGA Highly efficient implementationsof 1D FFT were used from LabVIEW FPGA for parallelrow-by-row operations and by integrating Xilinx LogiCOREfor column-by-column operations Each stage of the designwas dynamically tested and benchmarked The image wasstreamed from the host PC using a Direct Memory Access(DMA) FIFO 1D FFTs are performed in parallel rows of 8and stored in the DRAM via local memory in a tiled patternas explained in the previous section This follows readingseveral rows to extract a single column which is Fourier-transformed using Xilinx LogiCORE and is sent back to

12 International Journal of Reconfigurable Computing

I(00)

I(n-1 m-1)

i

j

N x M size image

a b

a

1D FFT

c

Ori

gina

l Im

age

Boun

dary

Imag

e

TrigonometricFunctions

Equation (7)

FFT ofSmooth Component

FFT of Periodic Component(With Edge Artifact

Removed)

b

Shortcut for Column-wise 1D FFT

Exte

rnal

Mem

ory

Exte

rnal

Mem

ory

Row-wise 1D FFT

c

Bloc

k M

emor

y

Exte

rnal

Mem

ory

Row-wise 1D FFT

Host PC Controller(NI PXIe-8880)

FPGA-1(NI PXIe-7976R)

Host PC Controller (NI PXIe-8880)

PXIe Chassis (NI-PXIe1085)

FPGA-2(NI PXIe-7976R)Block Memory

Bloc

k M

emor

y

+

minusColumn-wise 1D FFTlowast

Row-Major Memory Mapping Tile-Hopping Memory Mapping

I(n-1 j) ndash I(0 j)

I(i

0) ndash

I(i

m-1

)

I(i

m-1

) ndash I(

i 0)

I(0 j) ndash I(n-1 j)

middot1

b1jv

-middot1 + (b11+b1m)v

Figure 8 Functional block diagramof PXIe-based 2DFFT implementationwith simultaneous edge artifact removal using optimized periodicplus smooth decomposition The OPSD algorithm is split among two NI-7976R (Kintex-7) FPGA boards with 2GB external memory and ahost PC connected over a high-bandwidth bus The image is streamed from the PC controller to FPGA 1 and FPGA 2 FPGA 1 calculates therow-by-row 1D FFT followed by column-by-column 1D FFT with intermediate tile-hopping memory mapping and sends the result back tothe host PC FPGA 2 receives the image calculates the boundary image and proceeds to calculate the 1D FFT column-by-column FFT usingthe shortcut presented in (16) followed by row-by-row 1D FFTs and the result is sent back to the host PC

Camera

HostPC

External Memory

FPGA Board 1

FPGA

DMAFIFO

FIFO IN(Local Memory - (Local Memory ndash

Write)

FIFO OUT

Read)

PXIe

BU

S

CU

1D FF4H

1D FF41

1D FF42

Figure 9 Block diagram of 2D FFT showing data transfer betweenexternal memory and local memory scheduled via a Control Unit(CU)

the host PC If the image is being streamed directly froman imaging device which scans and provides random ornonlinear sequence of rows it is necessary to store a frameof the image in a buffer This can also be accomplishedby streaming the image flow from the host PC or using asmart camera which can delay image delivery by a singleframe Local memory shown in Figure 9 is used to bufferdata between external memory and 1D FFT cores Thismemory is divided into read and write components and isimplemented using FPGA slices Block RAM (BRAM) is usedfor temporary storage of vectors required for calculating the2D FFT of the boundary image (in the case of FPGA 2)The Control Unit (CU) organizes scheduling for transferring

data between local and external memory CU is based onLabVIEWrsquos existing memory controller and our memorymapping scheme presented in Section 5

Step B was accomplished using standard LabVIEWFPGA HLS tools for programming (5) using the graphicalprogramming environment In step C the 2D FFT of theboundary image needs to be calculated by row and columndecomposition However as shown mathematically in theprevious section the initial row-wise FFTs can be calculatedby computing the 1D FFT of the first (boundary) vector andthe FFTs of reaming vectors can be computed by appropriatescaling of this vector

We need the boundary column vector for 1D FFT calcula-tion of the first and last columns We also need the boundaryrow vector for appropriate scaling of for the 1D FFT ofevery column between the first and last columns Row andcolumn vectors of the boundary image are stored in blockRAM (BRAM) Figure 8 shows a functional block diagramof the overall 2D FFT with optimized PSD process Steps Dand E are performed on the host PC to minimize memoryclashes and to access the periodic and smooth componentsof each frame as they become available 86 of resources areused on FPGA 1 and 41 resources are used on FPGA 2 Theresource utilization is reported according to LabVIEWFPGAsynthesis and compilation experiments It should be notedthat part of the reason for high resource utilization is becauseof using LabVIEW FPGA high-level synthesis tools as wellas Xilinx LogiCORE tools Using standardized tools makesit harder to optimize for resource utilization The currentimplementation was optimized for performance in terms ofrun time and energy consumption

International Journal of Reconfigurable Computing 13

Fram

esS

econ

d (1

s)

105

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7)Kintex-7 (No DRAM Opt)Kintex-7 (DRAM Opt)Theoretical Peak

(a) 2D FFT performance DRAM tile-hopping

Fram

esS

econ

d (1

s)

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7) (PSD)Kintex-7 (PSD)Kintex-7 (OPSD)

(b) 2D FFT with EAR performance PSD versus OPSD

Figure 10 Performance evaluation in terms of frames per second for (a) 2D FFTs with tile-hopping memory pattern (b) 2D FFTs with edgeartifact removal (EAR) using OPSDThe performance evaluation shows the significance of the two optimizations proposed Both axes are ona log scale

63 Performance Evaluation The overall performance of thesystem was evaluated using the setup presented in Figure 9The data was streamed from the host PC in certain caseshigh frame rate videos as well as direct camera input werestreamed from the host All results presented are for 16-bitfixed-point precision Figure 10(a) presents the effectivenessof our proposed tile-hopping memory mapping scheme Itclearly shows the effectiveness of our proposed memorymapping since it is closer to the theoretical peak performanceFigure 10(b) presents the overall results comparing PSD andOPSD and demonstrating the effectiveness of our proposedoptimization PSD was also implemented on the same plat-form but the optimization presented in Section 4 was notused This rendered the serial portion of the algorithm to bethe bottleneck which reduced overall performance Table 2shows a comparison of our implementation in contrast torecent 2D FFT FPGA implementation approaches and showsthat we achieve a better performance even with simultaneousedge artifact removal Although our implementation is testedwith 16-bit fixed-point precision which limits the accuracy ofthe transform the precision may be sufficient for a varietyof speed critical applications where alternative edge artifactremoval methods (eg filtering) may decrease overall systemperformance Although the dynamic range of fixed-pointdata is smaller than floating-point data which can lead toerrors in the 2D FFT for certain applications it is moreimportant to have an artifact-free transform as compared toa highly accurate transform

64 Energy Evaluation The overall energy consumptionof the custom computing system depends on (1) powerperformance of the system components and (2) throughputdisparity Throughput disparity results in idle time for at

least one of the components and lowers the overall sys-tem throughput A throughput-optimized system minimizesinstances where certain components of the architecture areidle The proposed optimizations in Sections 4 and 5 clearlyreduce throughput disparity andminimize the idle time of thesystem Thus besides causing delays due to significant over-head standard column-wise DRAM access also contributesto the overall energy consumption This is not only due tohigh count of DRAM row charges but also because of energyconsumed by the FPGA in idle state Ideally maximizing theDRAMbandwidth limits the amount of energy consumptionProposed ldquotile-hoppingrdquo memory mapping scheme improvesthe DRAMbandwidth as seen in Figure 10 and hence reducesthe overall energy consumption The same is true for theproposed OPSD method where reduced DRAM access and1D FFT invocations lead to reduced energy consumption Inthis section we analyze the amount of improvement in energyconsumption based on the proposed optimizations

We estimate the DRAM power consumption for boththe baseline (standard strided) and the optimized (ldquotile-hoppingrdquo) memory access using MICRON DRAM powercalculator The energy is calculated in 119899119869 for each readie energy per read This is accomplished by calculatingthe run time for a specific 2D FFT and estimating theamount of energy consumed by the DRAM power calculatorTable 4 depicts the DRAM energy consumption for 2D FFTsbefore and after the proposed ldquotile-hoppingrdquo optimization forcolumn-wise DRAM access As mentioned earlier row-wiseDRAM access is fast and row buffer size data can be accessedby a single row activation According to Table 4 the energyrequired for DRAM access is reduced by 427 488 and529 for 1024 times 1024 2048 times 2048 and 4096 times 4096 size 2DFFTs respectively

14 International Journal of Reconfigurable Computing

Table 2 Comparison of OPSD1 2D FFT with regular RCD-based implementation

Platform SEAR2 Precision RT3 (ms)YesNo bits 512 times 512 1024 times 1024

Kintex 7 28nm (ours) Yes 16 (fixed) 15 48Kintex 7 28nm (ours) No 16 (fixed) 09 41Kintex 7 28mm [8] Yes 16 (fixed) 324 1167Stratix IV [5] No 64 (double) - 61Virtex-5-BEE3 65nm[14] No 32 (single) 249 1026Virtex-E 180nm [21] No 16 (fixed) 286 769ASIC 180nm No 32 (single) 210 -1 Optimized periodic + smooth decomposition (OPSD)2 Simultaneous edge artifact removal3 Runtime (ms)

Table 3 DRAM energy consumption baseline vs tile-hopping1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPRlowast CW∘ Read(Baseline) 446 577 712EPR CW ReadTile-Hopping 254 295 336(Proposed)Reduction () 427 488 529lowastEnergy per read (EPR)∘Column-wise memory access (CW)

Table 4 2D FFT + EAR energy consumption baseline vs optimized (OPSD + tile hopping)1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPPdagger 2D FFT + EARBaseline 3692 4125 48352D FFT+EAR (Opt)Tile-Hopping + OPSD 1588 1706 1811(Proposed)Improvement 23times 24times 28timesdaggerEnergy per point (EPP)

The metric used to compare the overall energy optimiza-tion achieved for 2D FFTs with EAR is energy per point iethe amount of average energy required to compute the 2DFFT of a single point in an image with simultaneous edgeartifact removal This was achieved by calculating the energyconsumed by Xilinx LogiCORE IP for 1D FFTs the DRAMand the edge artifact removal part separately The estimatedenergy calculated does not include energy consumed by thePXIe chassis and the host PC Essentially the FPGA-basedarchitecture presented here could be used without the hostcontroller The energy consumption incorporates dynamicas well as static power The overall energy consumption perpoint is reduced by 569 586 and 62 for calculating1024 times 1024 2048 times 2048 and 4096 times 4096 size 2D FFTs withEAR respectively

7 Application Filtered Back-Projectionfor Tomography

In order to further demonstrate the effectiveness of ourimplementation we use the created 2D FFT module asan accelerator for reducing the run time for filtered back-projection (FBP) FBP is a fundamental analytical tomo-graphic image reconstruction method In depth detailsregarding the basic FBP algorithm have been left out forbrevity but can be found in [28 29]Themethod can be usedto reconstruct primitive 3D tomograms from 2D data whichcan then be used as a basis for more complex regularization-basedmethods such as [30ndash32]The algorithmic flow is basedon the Fourier slice theorem ie 2D Fourier transformsof projections are an angular component of the 3D Fourier

International Journal of Reconfigurable Computing 15

Table 5 Comparing filtered back-projection (FBP) runtime (as an application for using the proposed 2D FFT with simultaneous EAR)

3D Density CPU (i7) FPGA + Host PC (i7)Sec Sec128 times 128 times 128 213 sec 195 sec256 times 256 times 256 475 sec 424 sec512 times 512 times 512 948 sec 813 sec1024 times 1024 times 1024 3223 sec 2753 sec2048 times 2048 times 2048 16877 sec 13644 sec4096 times 4096 times 4096 164631 sec 125994 sec

Original Phantom CPU Reconstructed Phantom FPGA+CPU Reconstructed Ph

Inte

nsity

Inte

nsity

1

08

06

04

02

0

1

08

06

04

02

0

X Direction0 20 40 60 80 100 120

X Direction0 20 40 60 80 100 120

Original PhantomCPU BP

Original PhantomFPGA+CPU BP

Figure 11 Figure showing a thin slice of filtered back-projectionresults by reconstructing a 128 times 128 times 128 Shepp-Logan phantomThe 3Ddensitywas reconstructed from 180 equally spaced simulatedprojections using standard linearly interpolated FBP and using aRam-Lak filter It can be seen that the results from the FPGA + CPUsolution have some errors this is due to the fact that the 2D FFT ofeach projection is less accurate The results are good enough to beused as a basis for further optimization based refinement methods

transform of the 3D reconstructed volume Our 2D FFTaccelerator was used to calculate the 2D FFTs of the projec-tions as well as for initial stages of the 3D FFTwhich was thencompleted on the host PC Similar to the 2D FFT the 3D FFTis separable and can be divided into 2D FFTs and 1D FFTsThe results have been shown in Table 5 It can be seen thatthe improvement for smaller size densities is not significantbecause their FFTs are quite fast on general-purpose CPUsHowever for larger densities the FFT accelerator can give asignificant improvement If the remaining components arealso implemented on an FPGA significant speed increasecan be achieved Results of a thin slice from a 3D simulatedShepp-Logan [33] phantom have been shown in Figure 11 Itcan be seen that the results from the hardware acceleratedFBP are of slightly lower quality This is due to the factthat our 2D FFT implementation is less accurate (16 bitfixed-point) as compared to the CPU-based implementation(FFTW double-precision floating) The accelerated FBP wasalso tested with real Electron Tomography (ET) data

8 Conclusion

2D FFTs often become a major bottleneck for high-performance imaging and vision systems The inherentcomputational complexity of the 2D FFT kernel is fur-ther enhanced if effective removal (using PSD) of spuriousartifacts introduced by the nonperiodic nature of real-lifeimages is taken into accountWe developed and implementedan FPGA-based design for calculating high-throughput 2DDFTs with simultaneous edge artifact removal Our approachis based on a PSD algorithm that splits the frequency domainof a 2D image into a smooth component which contains thehigh-frequency cross-shaped artifacts and can be subtractedfrom the 2D DFT of the original image to obtain a periodiccomponent that is artifact-free Since this approach calculatestwo 2D DFTs simultaneously external memory addressingand repeated 1D FFT invocations become problematic Tosolve this problem we optimized the original PSD algorithmto reduce the number of DFT samples to be computed andDRAM access Moreover to reduce strided access from theDRAMduring column-wise reads we presented and analyzedldquotile-hoppingrdquo a memorymapping scheme which reduces thenumber of DRAM row activations when reading a singlecolumn of data This memory mapping scheme is generaland may be used for a variety of other applications Wedemonstrate that ldquotile-hoppingrdquomemorymapping can reducethe DRAM energy consumption by 529 Moreover weshow that the proposed optimizations lead to 28times less energyconsumption for the overall 2D FFT with EAR architecture

Our methods were tested using extensive synthesis andbenchmarking using a Xilinx Kintex 7 FPGA communicatingwith a host PC on a high-speed PXIe bus Our system isexpandable to support several FPGAs and can be adaptedto various large-scale computer vision and biomedical appli-cations Despite decomposing the image into periodic andsmooth frequency components our design requires lessrun time compared to traditional FPGA-based 2D DFTimplementation approaches and can be used for a varietyof highly demanding applications One such applicationfiltered back-projection was accelerated using the proposedimplementation to achieve better results specifically for largersize raw tomographic data

Conflicts of Interest

The authors declare that they have no conflicts of interest

16 International Journal of Reconfigurable Computing

Acknowledgments

Thisworkwas supported by JapaneseGovernmentOIST Sub-sidy for Operations (Ulf Skoglund) Grant no 5020S7010020FaisalMahmood andMart Toots were additionally supportedby the OIST PhD Fellowship The authors would like tothank National Instruments Research for their technicalsupport during the design process The authors would alsolike to thank Dr Steven D Aird for language assistance andShizuka Kuda for logistical arrangements

Supplementary Materials

File 2D FFT-EAR-Demomp4 video demonstrating theFPGA output of 2D FFT with edge artifact removal(Supplementary Materials)

References

[1] R N Bracewell The fourier transform and iis applications vol5 New York NY USA 1965

[2] Theory and application of digital signal processing vol 1Prentice-Hall Inc Englewood Cliffs NJ USA 1975

[3] R A Brooks and G Di Chiro ldquoTheory of image reconstructionin computed tomographyrdquo Radiology vol 117 no 3 I pp 561ndash572 1975

[4] L Moisan ldquoPeriodic plus smooth image decompositionrdquo Jour-nal of Mathematical Imaging and Vision vol 39 no 2 pp 161ndash179 2011

[5] B Akin P A Milder F Franchetti and J C Hoe ldquoMemorybandwidth efficient two-dimensional fast Fourier transformalgorithm and implementation for large problem sizesrdquo inProceedings of the 20th IEEE International Symposium on Field-Programmable Custom Computing Machines FCCM 2012 pp188ndash191 Canada May 2012

[6] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[7] H Kee N Petersen J Kornerup and S S BhattacharyyaldquoSystematic generation of FPGA-based FFT implementationsrdquoin Proceedings of the 2008 IEEE International Conference onAcoustics Speech and Signal Processing ICASSP pp 1413ndash1416USA April 2008

[8] F Mahmood M Toots L-G Ofverstedt and U Skoglundldquo2DDiscrete Fourier Transformwith simultaneous edge artifactremoval for real-Time applicationsrdquo in Proceedings of the Inter-national Conference on Field Programmable Technology FPT2015 pp 236ndash239 New Zealand December 2015

[9] M Frigo and S G Johnson ldquoFFTW an adaptive software archi-tecture for the FFTrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing vol 3 pp1381ndash1384 IEEE May 1998

[10] M Puschel J M F Moura J R Johnson et al ldquoSPIRAL codegeneration for DSP transformsrdquo Proceedings of the IEEE vol 93no 2 pp 232ndash273 2005

[11] EWang Q Zhang B Shen et alHigh-Performance Computingon the Intel Xeon Phi Springer International Publishing 2014

[12] B Akin F Franchetti and J C Hoe ldquoFFTS with near-optimalmemory access through block data layoutsrdquo in Proceedings ofthe 2014 IEEE International Conference onAcoustics Speech andSignal Processing ICASSP 2014 pp 3898ndash3902 Italy May 2014

[13] B Akin F Franchetti and J C Hoe ldquoUnderstanding thedesign space of DRAM-optimized hardware FFT acceleratorsrdquoin Proceedings of the 25th IEEE International Conference onApplication-Specific Systems Architectures and Processors ASAP2014 pp 248ndash255 Switzerland June 2014

[14] C-L Yu K Irick C Chakrabarti and V Narayanan ldquoMul-tidimensional DFT IP generator for FPGA platformsrdquo IEEETransactions on Circuits and Systems I Regular Papers vol 58no 4 pp 755ndash764 2011

[15] DHe andQ Sun ldquoApractical print-scan resilientwatermarkingschemerdquo in Proceedings of the IEEE International Conference onImage Processing 2005 ICIP 2005 pp 257ndash260 Italy September2005

[16] D G Bailey Design for embedded image processing on FPGAsJohn Wiley Sons 2011

[17] B G Batchelor ldquoImplementing Machine Vision Systems UsingFPGAsrdquo in Machine Vision Handbook 1136 p 1103 SpringerLondon UK 2012

[18] T Lenart M Gustafsson and V Owall ldquoA hardware accelera-tion platform for digital holographic imagingrdquo Journal of SignalProcessing Systems vol 52 no 3 pp 297ndash311 2008

[19] G H Loh ldquo3D-Stacked Memory Architectures for Multi-coreProcessorsrdquo ACM SIGARCH Computer Architecture News vol36 no 3 pp 453ndash464 2008

[20] Z Qiuling B Akin H E Sumbul et al ldquoA 3D-stacked logic-in-memory accelerator for application-specific data intensivecomputingrdquo in Proceedings of the IEEE International 3D SystemsIntegration Conference pp 1ndash7 October 2013

[21] I S Uzun A Amira and A Bouridane ldquoFPGA implementa-tions of fast Fourier transforms for real-time signal and imageprocessingrdquo pp 283ndash296

[22] T Dillon ldquoTwo Virtex-II FPGAs deliver fastest cheapest besthigh-performance image processing systemrdquo rdquo Xilinx XcellJournal vol 41 pp 70ndash73 2001

[23] A Hast ldquoRobust and invariant phase based local featurematchingrdquo in Proceedings of the 22nd International Conferenceon Pattern Recognition ICPR 2014 pp 809ndash814 Sweden August2014

[24] B Galerne Y Gousseau and J-M Morel ldquoRandom phasetextures theory and synthesisrdquo IEEE Transactions on ImageProcessing vol 20 no 1 pp 257ndash267 2011

[25] R Hovden Y Jiang H L Xin and L F Kourkoutis ldquoPeriodicArtifact Reduction in Fourier Transforms of Full Field AtomicResolution Imagesrdquo Microscopy and Microanalysis vol 21 no2 pp 436ndash441 2014

[26] D G Bailey ldquoThe advantages and limitations of high levelsynthesis for FPGA based image processingrdquo in Proceedings ofthe the 9th International Conference pp 134ndash139 Seville SpainSeptember 2015

[27] W Wang B Duan C Zhang P Zhang and N Sun ldquoAcceler-ating 2D FFT with non-power-of-two problem size on FPGArdquoin Proceedings of the 2010 International Conference on Recon-figurable Computing and FPGAs ReConFig 2010 pp 208ndash213Mexico December 2010

[28] F NattererTheMathematics of Computerized Tomography JohnWiley amp Sons 1986

[29] R A Crowther D J DeRosier and A Klug ldquoThe Reconstruc-tion of aThree-Dimensional Structure from Projections and itsApplication to Electron Microscopyrdquo Proceedings of the RoyalSociety A Mathematical Physical and Engineering Sciences vol317 no 1530 pp 319ndash340 1970

International Journal of Reconfigurable Computing 17

[30] U Skoglund L-G Ofverstedt R M Burnett and G BricogneldquoMaximum-entropy three-dimensional reconstruction withdeconvolution of the contrast transfer function A test applica-tion with adenovirusrdquo Journal of Structural Biology vol 117 no3 pp 173ndash188 1996

[31] F Mahmood N Shahid P Vandergheynst and U SkoglundldquoGraph-based sinogram denoising for tomographic reconstruc-tionsrdquo in Proceedings of the 38th Annual International Confer-ence of the IEEE Engineering in Medicine and Biology SocietyEMBC 2016 pp 3961ndash3964 USA August 2016

[32] F Mahmood N Shahid U Skoglund and P VandergheynstldquoAdaptive Graph-Based Total Variation for TomographicReconstructionsrdquo IEEE Signal Processing Letters vol 25 no 5pp 700ndash704 2018

[33] L A Shepp and B F Logan Jr ldquoThe Fourier reconstruction of ahead sectionrdquo IEEE Transactions on Nuclear Science vol 21 no3 pp 21ndash43 1974

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 11: Algorithm and Architecture Optimization for 2D Discrete ...downloads.hindawi.com/journals/ijrc/2018/1403181.pdf · Algorithm and Architecture Optimization for 2D Discrete Fourier

International Journal of Reconfigurable Computing 11

n

n

Row-by-Row (RR) Output

(a) Dataset view

k

p

q

Writing RR Output to DRAM

Row Buffer

(b) Memory view

k

Reading from DRAM

Row Buffer

(c) Memory view

Figure 6 Image showing tile-hopping (a) Image-level view showing tiles (b) DRAM-level view showing tile placement while writing (c)DRAM-level view showing column reading from the tiles

External IO PXIe

Host PC ControllerDisplay

Device

PXIeFPGA Board 1

High-throughput PXIe BUSPXIe Chassis

External IO

PXIeFPGA Board 2

External IO

PXIeFPGA Board 3

External IO

Figure 7 Block diagram of a PXIe-based multi-FPGA system witha host PC controller connected through a high-speed bus on a PXIechassis [8]

calculating the boundary image (C) calculating the 2DFFT of the boundary image (D) calculating the smoothcomponent and (E) subtracting the smooth component fromthe 2D FFT of the original image to achieve the periodiccomponent The bottleneck consistently occurs in A and theserial part of the algorithm (A 997888rarr B 997888rarr C) The limitation

due to A is reduced removing the so-called ldquomemory wallrdquoby using our proposed tile-hopping-based memory mappingThe limitations due to the serial part of the algorithm arereduced by using OPSD rather than PSD For quantificationthe delay for A is 062ms for a 512 times 512 image

The design flow presented in Figure 4 was followedData-flow is clearly shown in a graphical programmingenvironment making it easier to visualize how a designefficiently fits on an FPGA Highly efficient implementationsof 1D FFT were used from LabVIEW FPGA for parallelrow-by-row operations and by integrating Xilinx LogiCOREfor column-by-column operations Each stage of the designwas dynamically tested and benchmarked The image wasstreamed from the host PC using a Direct Memory Access(DMA) FIFO 1D FFTs are performed in parallel rows of 8and stored in the DRAM via local memory in a tiled patternas explained in the previous section This follows readingseveral rows to extract a single column which is Fourier-transformed using Xilinx LogiCORE and is sent back to

12 International Journal of Reconfigurable Computing

I(00)

I(n-1 m-1)

i

j

N x M size image

a b

a

1D FFT

c

Ori

gina

l Im

age

Boun

dary

Imag

e

TrigonometricFunctions

Equation (7)

FFT ofSmooth Component

FFT of Periodic Component(With Edge Artifact

Removed)

b

Shortcut for Column-wise 1D FFT

Exte

rnal

Mem

ory

Exte

rnal

Mem

ory

Row-wise 1D FFT

c

Bloc

k M

emor

y

Exte

rnal

Mem

ory

Row-wise 1D FFT

Host PC Controller(NI PXIe-8880)

FPGA-1(NI PXIe-7976R)

Host PC Controller (NI PXIe-8880)

PXIe Chassis (NI-PXIe1085)

FPGA-2(NI PXIe-7976R)Block Memory

Bloc

k M

emor

y

+

minusColumn-wise 1D FFTlowast

Row-Major Memory Mapping Tile-Hopping Memory Mapping

I(n-1 j) ndash I(0 j)

I(i

0) ndash

I(i

m-1

)

I(i

m-1

) ndash I(

i 0)

I(0 j) ndash I(n-1 j)

middot1

b1jv

-middot1 + (b11+b1m)v

Figure 8 Functional block diagramof PXIe-based 2DFFT implementationwith simultaneous edge artifact removal using optimized periodicplus smooth decomposition The OPSD algorithm is split among two NI-7976R (Kintex-7) FPGA boards with 2GB external memory and ahost PC connected over a high-bandwidth bus The image is streamed from the PC controller to FPGA 1 and FPGA 2 FPGA 1 calculates therow-by-row 1D FFT followed by column-by-column 1D FFT with intermediate tile-hopping memory mapping and sends the result back tothe host PC FPGA 2 receives the image calculates the boundary image and proceeds to calculate the 1D FFT column-by-column FFT usingthe shortcut presented in (16) followed by row-by-row 1D FFTs and the result is sent back to the host PC

Camera

HostPC

External Memory

FPGA Board 1

FPGA

DMAFIFO

FIFO IN(Local Memory - (Local Memory ndash

Write)

FIFO OUT

Read)

PXIe

BU

S

CU

1D FF4H

1D FF41

1D FF42

Figure 9 Block diagram of 2D FFT showing data transfer betweenexternal memory and local memory scheduled via a Control Unit(CU)

the host PC If the image is being streamed directly froman imaging device which scans and provides random ornonlinear sequence of rows it is necessary to store a frameof the image in a buffer This can also be accomplishedby streaming the image flow from the host PC or using asmart camera which can delay image delivery by a singleframe Local memory shown in Figure 9 is used to bufferdata between external memory and 1D FFT cores Thismemory is divided into read and write components and isimplemented using FPGA slices Block RAM (BRAM) is usedfor temporary storage of vectors required for calculating the2D FFT of the boundary image (in the case of FPGA 2)The Control Unit (CU) organizes scheduling for transferring

data between local and external memory CU is based onLabVIEWrsquos existing memory controller and our memorymapping scheme presented in Section 5

Step B was accomplished using standard LabVIEWFPGA HLS tools for programming (5) using the graphicalprogramming environment In step C the 2D FFT of theboundary image needs to be calculated by row and columndecomposition However as shown mathematically in theprevious section the initial row-wise FFTs can be calculatedby computing the 1D FFT of the first (boundary) vector andthe FFTs of reaming vectors can be computed by appropriatescaling of this vector

We need the boundary column vector for 1D FFT calcula-tion of the first and last columns We also need the boundaryrow vector for appropriate scaling of for the 1D FFT ofevery column between the first and last columns Row andcolumn vectors of the boundary image are stored in blockRAM (BRAM) Figure 8 shows a functional block diagramof the overall 2D FFT with optimized PSD process Steps Dand E are performed on the host PC to minimize memoryclashes and to access the periodic and smooth componentsof each frame as they become available 86 of resources areused on FPGA 1 and 41 resources are used on FPGA 2 Theresource utilization is reported according to LabVIEWFPGAsynthesis and compilation experiments It should be notedthat part of the reason for high resource utilization is becauseof using LabVIEW FPGA high-level synthesis tools as wellas Xilinx LogiCORE tools Using standardized tools makesit harder to optimize for resource utilization The currentimplementation was optimized for performance in terms ofrun time and energy consumption

International Journal of Reconfigurable Computing 13

Fram

esS

econ

d (1

s)

105

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7)Kintex-7 (No DRAM Opt)Kintex-7 (DRAM Opt)Theoretical Peak

(a) 2D FFT performance DRAM tile-hopping

Fram

esS

econ

d (1

s)

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7) (PSD)Kintex-7 (PSD)Kintex-7 (OPSD)

(b) 2D FFT with EAR performance PSD versus OPSD

Figure 10 Performance evaluation in terms of frames per second for (a) 2D FFTs with tile-hopping memory pattern (b) 2D FFTs with edgeartifact removal (EAR) using OPSDThe performance evaluation shows the significance of the two optimizations proposed Both axes are ona log scale

63 Performance Evaluation The overall performance of thesystem was evaluated using the setup presented in Figure 9The data was streamed from the host PC in certain caseshigh frame rate videos as well as direct camera input werestreamed from the host All results presented are for 16-bitfixed-point precision Figure 10(a) presents the effectivenessof our proposed tile-hopping memory mapping scheme Itclearly shows the effectiveness of our proposed memorymapping since it is closer to the theoretical peak performanceFigure 10(b) presents the overall results comparing PSD andOPSD and demonstrating the effectiveness of our proposedoptimization PSD was also implemented on the same plat-form but the optimization presented in Section 4 was notused This rendered the serial portion of the algorithm to bethe bottleneck which reduced overall performance Table 2shows a comparison of our implementation in contrast torecent 2D FFT FPGA implementation approaches and showsthat we achieve a better performance even with simultaneousedge artifact removal Although our implementation is testedwith 16-bit fixed-point precision which limits the accuracy ofthe transform the precision may be sufficient for a varietyof speed critical applications where alternative edge artifactremoval methods (eg filtering) may decrease overall systemperformance Although the dynamic range of fixed-pointdata is smaller than floating-point data which can lead toerrors in the 2D FFT for certain applications it is moreimportant to have an artifact-free transform as compared toa highly accurate transform

64 Energy Evaluation The overall energy consumptionof the custom computing system depends on (1) powerperformance of the system components and (2) throughputdisparity Throughput disparity results in idle time for at

least one of the components and lowers the overall sys-tem throughput A throughput-optimized system minimizesinstances where certain components of the architecture areidle The proposed optimizations in Sections 4 and 5 clearlyreduce throughput disparity andminimize the idle time of thesystem Thus besides causing delays due to significant over-head standard column-wise DRAM access also contributesto the overall energy consumption This is not only due tohigh count of DRAM row charges but also because of energyconsumed by the FPGA in idle state Ideally maximizing theDRAMbandwidth limits the amount of energy consumptionProposed ldquotile-hoppingrdquo memory mapping scheme improvesthe DRAMbandwidth as seen in Figure 10 and hence reducesthe overall energy consumption The same is true for theproposed OPSD method where reduced DRAM access and1D FFT invocations lead to reduced energy consumption Inthis section we analyze the amount of improvement in energyconsumption based on the proposed optimizations

We estimate the DRAM power consumption for boththe baseline (standard strided) and the optimized (ldquotile-hoppingrdquo) memory access using MICRON DRAM powercalculator The energy is calculated in 119899119869 for each readie energy per read This is accomplished by calculatingthe run time for a specific 2D FFT and estimating theamount of energy consumed by the DRAM power calculatorTable 4 depicts the DRAM energy consumption for 2D FFTsbefore and after the proposed ldquotile-hoppingrdquo optimization forcolumn-wise DRAM access As mentioned earlier row-wiseDRAM access is fast and row buffer size data can be accessedby a single row activation According to Table 4 the energyrequired for DRAM access is reduced by 427 488 and529 for 1024 times 1024 2048 times 2048 and 4096 times 4096 size 2DFFTs respectively

14 International Journal of Reconfigurable Computing

Table 2 Comparison of OPSD1 2D FFT with regular RCD-based implementation

Platform SEAR2 Precision RT3 (ms)YesNo bits 512 times 512 1024 times 1024

Kintex 7 28nm (ours) Yes 16 (fixed) 15 48Kintex 7 28nm (ours) No 16 (fixed) 09 41Kintex 7 28mm [8] Yes 16 (fixed) 324 1167Stratix IV [5] No 64 (double) - 61Virtex-5-BEE3 65nm[14] No 32 (single) 249 1026Virtex-E 180nm [21] No 16 (fixed) 286 769ASIC 180nm No 32 (single) 210 -1 Optimized periodic + smooth decomposition (OPSD)2 Simultaneous edge artifact removal3 Runtime (ms)

Table 3 DRAM energy consumption baseline vs tile-hopping1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPRlowast CW∘ Read(Baseline) 446 577 712EPR CW ReadTile-Hopping 254 295 336(Proposed)Reduction () 427 488 529lowastEnergy per read (EPR)∘Column-wise memory access (CW)

Table 4 2D FFT + EAR energy consumption baseline vs optimized (OPSD + tile hopping)1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPPdagger 2D FFT + EARBaseline 3692 4125 48352D FFT+EAR (Opt)Tile-Hopping + OPSD 1588 1706 1811(Proposed)Improvement 23times 24times 28timesdaggerEnergy per point (EPP)

The metric used to compare the overall energy optimiza-tion achieved for 2D FFTs with EAR is energy per point iethe amount of average energy required to compute the 2DFFT of a single point in an image with simultaneous edgeartifact removal This was achieved by calculating the energyconsumed by Xilinx LogiCORE IP for 1D FFTs the DRAMand the edge artifact removal part separately The estimatedenergy calculated does not include energy consumed by thePXIe chassis and the host PC Essentially the FPGA-basedarchitecture presented here could be used without the hostcontroller The energy consumption incorporates dynamicas well as static power The overall energy consumption perpoint is reduced by 569 586 and 62 for calculating1024 times 1024 2048 times 2048 and 4096 times 4096 size 2D FFTs withEAR respectively

7 Application Filtered Back-Projectionfor Tomography

In order to further demonstrate the effectiveness of ourimplementation we use the created 2D FFT module asan accelerator for reducing the run time for filtered back-projection (FBP) FBP is a fundamental analytical tomo-graphic image reconstruction method In depth detailsregarding the basic FBP algorithm have been left out forbrevity but can be found in [28 29]Themethod can be usedto reconstruct primitive 3D tomograms from 2D data whichcan then be used as a basis for more complex regularization-basedmethods such as [30ndash32]The algorithmic flow is basedon the Fourier slice theorem ie 2D Fourier transformsof projections are an angular component of the 3D Fourier

International Journal of Reconfigurable Computing 15

Table 5 Comparing filtered back-projection (FBP) runtime (as an application for using the proposed 2D FFT with simultaneous EAR)

3D Density CPU (i7) FPGA + Host PC (i7)Sec Sec128 times 128 times 128 213 sec 195 sec256 times 256 times 256 475 sec 424 sec512 times 512 times 512 948 sec 813 sec1024 times 1024 times 1024 3223 sec 2753 sec2048 times 2048 times 2048 16877 sec 13644 sec4096 times 4096 times 4096 164631 sec 125994 sec

Original Phantom CPU Reconstructed Phantom FPGA+CPU Reconstructed Ph

Inte

nsity

Inte

nsity

1

08

06

04

02

0

1

08

06

04

02

0

X Direction0 20 40 60 80 100 120

X Direction0 20 40 60 80 100 120

Original PhantomCPU BP

Original PhantomFPGA+CPU BP

Figure 11 Figure showing a thin slice of filtered back-projectionresults by reconstructing a 128 times 128 times 128 Shepp-Logan phantomThe 3Ddensitywas reconstructed from 180 equally spaced simulatedprojections using standard linearly interpolated FBP and using aRam-Lak filter It can be seen that the results from the FPGA + CPUsolution have some errors this is due to the fact that the 2D FFT ofeach projection is less accurate The results are good enough to beused as a basis for further optimization based refinement methods

transform of the 3D reconstructed volume Our 2D FFTaccelerator was used to calculate the 2D FFTs of the projec-tions as well as for initial stages of the 3D FFTwhich was thencompleted on the host PC Similar to the 2D FFT the 3D FFTis separable and can be divided into 2D FFTs and 1D FFTsThe results have been shown in Table 5 It can be seen thatthe improvement for smaller size densities is not significantbecause their FFTs are quite fast on general-purpose CPUsHowever for larger densities the FFT accelerator can give asignificant improvement If the remaining components arealso implemented on an FPGA significant speed increasecan be achieved Results of a thin slice from a 3D simulatedShepp-Logan [33] phantom have been shown in Figure 11 Itcan be seen that the results from the hardware acceleratedFBP are of slightly lower quality This is due to the factthat our 2D FFT implementation is less accurate (16 bitfixed-point) as compared to the CPU-based implementation(FFTW double-precision floating) The accelerated FBP wasalso tested with real Electron Tomography (ET) data

8 Conclusion

2D FFTs often become a major bottleneck for high-performance imaging and vision systems The inherentcomputational complexity of the 2D FFT kernel is fur-ther enhanced if effective removal (using PSD) of spuriousartifacts introduced by the nonperiodic nature of real-lifeimages is taken into accountWe developed and implementedan FPGA-based design for calculating high-throughput 2DDFTs with simultaneous edge artifact removal Our approachis based on a PSD algorithm that splits the frequency domainof a 2D image into a smooth component which contains thehigh-frequency cross-shaped artifacts and can be subtractedfrom the 2D DFT of the original image to obtain a periodiccomponent that is artifact-free Since this approach calculatestwo 2D DFTs simultaneously external memory addressingand repeated 1D FFT invocations become problematic Tosolve this problem we optimized the original PSD algorithmto reduce the number of DFT samples to be computed andDRAM access Moreover to reduce strided access from theDRAMduring column-wise reads we presented and analyzedldquotile-hoppingrdquo a memorymapping scheme which reduces thenumber of DRAM row activations when reading a singlecolumn of data This memory mapping scheme is generaland may be used for a variety of other applications Wedemonstrate that ldquotile-hoppingrdquomemorymapping can reducethe DRAM energy consumption by 529 Moreover weshow that the proposed optimizations lead to 28times less energyconsumption for the overall 2D FFT with EAR architecture

Our methods were tested using extensive synthesis andbenchmarking using a Xilinx Kintex 7 FPGA communicatingwith a host PC on a high-speed PXIe bus Our system isexpandable to support several FPGAs and can be adaptedto various large-scale computer vision and biomedical appli-cations Despite decomposing the image into periodic andsmooth frequency components our design requires lessrun time compared to traditional FPGA-based 2D DFTimplementation approaches and can be used for a varietyof highly demanding applications One such applicationfiltered back-projection was accelerated using the proposedimplementation to achieve better results specifically for largersize raw tomographic data

Conflicts of Interest

The authors declare that they have no conflicts of interest

16 International Journal of Reconfigurable Computing

Acknowledgments

Thisworkwas supported by JapaneseGovernmentOIST Sub-sidy for Operations (Ulf Skoglund) Grant no 5020S7010020FaisalMahmood andMart Toots were additionally supportedby the OIST PhD Fellowship The authors would like tothank National Instruments Research for their technicalsupport during the design process The authors would alsolike to thank Dr Steven D Aird for language assistance andShizuka Kuda for logistical arrangements

Supplementary Materials

File 2D FFT-EAR-Demomp4 video demonstrating theFPGA output of 2D FFT with edge artifact removal(Supplementary Materials)

References

[1] R N Bracewell The fourier transform and iis applications vol5 New York NY USA 1965

[2] Theory and application of digital signal processing vol 1Prentice-Hall Inc Englewood Cliffs NJ USA 1975

[3] R A Brooks and G Di Chiro ldquoTheory of image reconstructionin computed tomographyrdquo Radiology vol 117 no 3 I pp 561ndash572 1975

[4] L Moisan ldquoPeriodic plus smooth image decompositionrdquo Jour-nal of Mathematical Imaging and Vision vol 39 no 2 pp 161ndash179 2011

[5] B Akin P A Milder F Franchetti and J C Hoe ldquoMemorybandwidth efficient two-dimensional fast Fourier transformalgorithm and implementation for large problem sizesrdquo inProceedings of the 20th IEEE International Symposium on Field-Programmable Custom Computing Machines FCCM 2012 pp188ndash191 Canada May 2012

[6] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[7] H Kee N Petersen J Kornerup and S S BhattacharyyaldquoSystematic generation of FPGA-based FFT implementationsrdquoin Proceedings of the 2008 IEEE International Conference onAcoustics Speech and Signal Processing ICASSP pp 1413ndash1416USA April 2008

[8] F Mahmood M Toots L-G Ofverstedt and U Skoglundldquo2DDiscrete Fourier Transformwith simultaneous edge artifactremoval for real-Time applicationsrdquo in Proceedings of the Inter-national Conference on Field Programmable Technology FPT2015 pp 236ndash239 New Zealand December 2015

[9] M Frigo and S G Johnson ldquoFFTW an adaptive software archi-tecture for the FFTrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing vol 3 pp1381ndash1384 IEEE May 1998

[10] M Puschel J M F Moura J R Johnson et al ldquoSPIRAL codegeneration for DSP transformsrdquo Proceedings of the IEEE vol 93no 2 pp 232ndash273 2005

[11] EWang Q Zhang B Shen et alHigh-Performance Computingon the Intel Xeon Phi Springer International Publishing 2014

[12] B Akin F Franchetti and J C Hoe ldquoFFTS with near-optimalmemory access through block data layoutsrdquo in Proceedings ofthe 2014 IEEE International Conference onAcoustics Speech andSignal Processing ICASSP 2014 pp 3898ndash3902 Italy May 2014

[13] B Akin F Franchetti and J C Hoe ldquoUnderstanding thedesign space of DRAM-optimized hardware FFT acceleratorsrdquoin Proceedings of the 25th IEEE International Conference onApplication-Specific Systems Architectures and Processors ASAP2014 pp 248ndash255 Switzerland June 2014

[14] C-L Yu K Irick C Chakrabarti and V Narayanan ldquoMul-tidimensional DFT IP generator for FPGA platformsrdquo IEEETransactions on Circuits and Systems I Regular Papers vol 58no 4 pp 755ndash764 2011

[15] DHe andQ Sun ldquoApractical print-scan resilientwatermarkingschemerdquo in Proceedings of the IEEE International Conference onImage Processing 2005 ICIP 2005 pp 257ndash260 Italy September2005

[16] D G Bailey Design for embedded image processing on FPGAsJohn Wiley Sons 2011

[17] B G Batchelor ldquoImplementing Machine Vision Systems UsingFPGAsrdquo in Machine Vision Handbook 1136 p 1103 SpringerLondon UK 2012

[18] T Lenart M Gustafsson and V Owall ldquoA hardware accelera-tion platform for digital holographic imagingrdquo Journal of SignalProcessing Systems vol 52 no 3 pp 297ndash311 2008

[19] G H Loh ldquo3D-Stacked Memory Architectures for Multi-coreProcessorsrdquo ACM SIGARCH Computer Architecture News vol36 no 3 pp 453ndash464 2008

[20] Z Qiuling B Akin H E Sumbul et al ldquoA 3D-stacked logic-in-memory accelerator for application-specific data intensivecomputingrdquo in Proceedings of the IEEE International 3D SystemsIntegration Conference pp 1ndash7 October 2013

[21] I S Uzun A Amira and A Bouridane ldquoFPGA implementa-tions of fast Fourier transforms for real-time signal and imageprocessingrdquo pp 283ndash296

[22] T Dillon ldquoTwo Virtex-II FPGAs deliver fastest cheapest besthigh-performance image processing systemrdquo rdquo Xilinx XcellJournal vol 41 pp 70ndash73 2001

[23] A Hast ldquoRobust and invariant phase based local featurematchingrdquo in Proceedings of the 22nd International Conferenceon Pattern Recognition ICPR 2014 pp 809ndash814 Sweden August2014

[24] B Galerne Y Gousseau and J-M Morel ldquoRandom phasetextures theory and synthesisrdquo IEEE Transactions on ImageProcessing vol 20 no 1 pp 257ndash267 2011

[25] R Hovden Y Jiang H L Xin and L F Kourkoutis ldquoPeriodicArtifact Reduction in Fourier Transforms of Full Field AtomicResolution Imagesrdquo Microscopy and Microanalysis vol 21 no2 pp 436ndash441 2014

[26] D G Bailey ldquoThe advantages and limitations of high levelsynthesis for FPGA based image processingrdquo in Proceedings ofthe the 9th International Conference pp 134ndash139 Seville SpainSeptember 2015

[27] W Wang B Duan C Zhang P Zhang and N Sun ldquoAcceler-ating 2D FFT with non-power-of-two problem size on FPGArdquoin Proceedings of the 2010 International Conference on Recon-figurable Computing and FPGAs ReConFig 2010 pp 208ndash213Mexico December 2010

[28] F NattererTheMathematics of Computerized Tomography JohnWiley amp Sons 1986

[29] R A Crowther D J DeRosier and A Klug ldquoThe Reconstruc-tion of aThree-Dimensional Structure from Projections and itsApplication to Electron Microscopyrdquo Proceedings of the RoyalSociety A Mathematical Physical and Engineering Sciences vol317 no 1530 pp 319ndash340 1970

International Journal of Reconfigurable Computing 17

[30] U Skoglund L-G Ofverstedt R M Burnett and G BricogneldquoMaximum-entropy three-dimensional reconstruction withdeconvolution of the contrast transfer function A test applica-tion with adenovirusrdquo Journal of Structural Biology vol 117 no3 pp 173ndash188 1996

[31] F Mahmood N Shahid P Vandergheynst and U SkoglundldquoGraph-based sinogram denoising for tomographic reconstruc-tionsrdquo in Proceedings of the 38th Annual International Confer-ence of the IEEE Engineering in Medicine and Biology SocietyEMBC 2016 pp 3961ndash3964 USA August 2016

[32] F Mahmood N Shahid U Skoglund and P VandergheynstldquoAdaptive Graph-Based Total Variation for TomographicReconstructionsrdquo IEEE Signal Processing Letters vol 25 no 5pp 700ndash704 2018

[33] L A Shepp and B F Logan Jr ldquoThe Fourier reconstruction of ahead sectionrdquo IEEE Transactions on Nuclear Science vol 21 no3 pp 21ndash43 1974

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 12: Algorithm and Architecture Optimization for 2D Discrete ...downloads.hindawi.com/journals/ijrc/2018/1403181.pdf · Algorithm and Architecture Optimization for 2D Discrete Fourier

12 International Journal of Reconfigurable Computing

I(00)

I(n-1 m-1)

i

j

N x M size image

a b

a

1D FFT

c

Ori

gina

l Im

age

Boun

dary

Imag

e

TrigonometricFunctions

Equation (7)

FFT ofSmooth Component

FFT of Periodic Component(With Edge Artifact

Removed)

b

Shortcut for Column-wise 1D FFT

Exte

rnal

Mem

ory

Exte

rnal

Mem

ory

Row-wise 1D FFT

c

Bloc

k M

emor

y

Exte

rnal

Mem

ory

Row-wise 1D FFT

Host PC Controller(NI PXIe-8880)

FPGA-1(NI PXIe-7976R)

Host PC Controller (NI PXIe-8880)

PXIe Chassis (NI-PXIe1085)

FPGA-2(NI PXIe-7976R)Block Memory

Bloc

k M

emor

y

+

minusColumn-wise 1D FFTlowast

Row-Major Memory Mapping Tile-Hopping Memory Mapping

I(n-1 j) ndash I(0 j)

I(i

0) ndash

I(i

m-1

)

I(i

m-1

) ndash I(

i 0)

I(0 j) ndash I(n-1 j)

middot1

b1jv

-middot1 + (b11+b1m)v

Figure 8 Functional block diagramof PXIe-based 2DFFT implementationwith simultaneous edge artifact removal using optimized periodicplus smooth decomposition The OPSD algorithm is split among two NI-7976R (Kintex-7) FPGA boards with 2GB external memory and ahost PC connected over a high-bandwidth bus The image is streamed from the PC controller to FPGA 1 and FPGA 2 FPGA 1 calculates therow-by-row 1D FFT followed by column-by-column 1D FFT with intermediate tile-hopping memory mapping and sends the result back tothe host PC FPGA 2 receives the image calculates the boundary image and proceeds to calculate the 1D FFT column-by-column FFT usingthe shortcut presented in (16) followed by row-by-row 1D FFTs and the result is sent back to the host PC

Camera

HostPC

External Memory

FPGA Board 1

FPGA

DMAFIFO

FIFO IN(Local Memory - (Local Memory ndash

Write)

FIFO OUT

Read)

PXIe

BU

S

CU

1D FF4H

1D FF41

1D FF42

Figure 9 Block diagram of 2D FFT showing data transfer betweenexternal memory and local memory scheduled via a Control Unit(CU)

the host PC If the image is being streamed directly froman imaging device which scans and provides random ornonlinear sequence of rows it is necessary to store a frameof the image in a buffer This can also be accomplishedby streaming the image flow from the host PC or using asmart camera which can delay image delivery by a singleframe Local memory shown in Figure 9 is used to bufferdata between external memory and 1D FFT cores Thismemory is divided into read and write components and isimplemented using FPGA slices Block RAM (BRAM) is usedfor temporary storage of vectors required for calculating the2D FFT of the boundary image (in the case of FPGA 2)The Control Unit (CU) organizes scheduling for transferring

data between local and external memory CU is based onLabVIEWrsquos existing memory controller and our memorymapping scheme presented in Section 5

Step B was accomplished using standard LabVIEWFPGA HLS tools for programming (5) using the graphicalprogramming environment In step C the 2D FFT of theboundary image needs to be calculated by row and columndecomposition However as shown mathematically in theprevious section the initial row-wise FFTs can be calculatedby computing the 1D FFT of the first (boundary) vector andthe FFTs of reaming vectors can be computed by appropriatescaling of this vector

We need the boundary column vector for 1D FFT calcula-tion of the first and last columns We also need the boundaryrow vector for appropriate scaling of for the 1D FFT ofevery column between the first and last columns Row andcolumn vectors of the boundary image are stored in blockRAM (BRAM) Figure 8 shows a functional block diagramof the overall 2D FFT with optimized PSD process Steps Dand E are performed on the host PC to minimize memoryclashes and to access the periodic and smooth componentsof each frame as they become available 86 of resources areused on FPGA 1 and 41 resources are used on FPGA 2 Theresource utilization is reported according to LabVIEWFPGAsynthesis and compilation experiments It should be notedthat part of the reason for high resource utilization is becauseof using LabVIEW FPGA high-level synthesis tools as wellas Xilinx LogiCORE tools Using standardized tools makesit harder to optimize for resource utilization The currentimplementation was optimized for performance in terms ofrun time and energy consumption

International Journal of Reconfigurable Computing 13

Fram

esS

econ

d (1

s)

105

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7)Kintex-7 (No DRAM Opt)Kintex-7 (DRAM Opt)Theoretical Peak

(a) 2D FFT performance DRAM tile-hopping

Fram

esS

econ

d (1

s)

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7) (PSD)Kintex-7 (PSD)Kintex-7 (OPSD)

(b) 2D FFT with EAR performance PSD versus OPSD

Figure 10 Performance evaluation in terms of frames per second for (a) 2D FFTs with tile-hopping memory pattern (b) 2D FFTs with edgeartifact removal (EAR) using OPSDThe performance evaluation shows the significance of the two optimizations proposed Both axes are ona log scale

63 Performance Evaluation The overall performance of thesystem was evaluated using the setup presented in Figure 9The data was streamed from the host PC in certain caseshigh frame rate videos as well as direct camera input werestreamed from the host All results presented are for 16-bitfixed-point precision Figure 10(a) presents the effectivenessof our proposed tile-hopping memory mapping scheme Itclearly shows the effectiveness of our proposed memorymapping since it is closer to the theoretical peak performanceFigure 10(b) presents the overall results comparing PSD andOPSD and demonstrating the effectiveness of our proposedoptimization PSD was also implemented on the same plat-form but the optimization presented in Section 4 was notused This rendered the serial portion of the algorithm to bethe bottleneck which reduced overall performance Table 2shows a comparison of our implementation in contrast torecent 2D FFT FPGA implementation approaches and showsthat we achieve a better performance even with simultaneousedge artifact removal Although our implementation is testedwith 16-bit fixed-point precision which limits the accuracy ofthe transform the precision may be sufficient for a varietyof speed critical applications where alternative edge artifactremoval methods (eg filtering) may decrease overall systemperformance Although the dynamic range of fixed-pointdata is smaller than floating-point data which can lead toerrors in the 2D FFT for certain applications it is moreimportant to have an artifact-free transform as compared toa highly accurate transform

64 Energy Evaluation The overall energy consumptionof the custom computing system depends on (1) powerperformance of the system components and (2) throughputdisparity Throughput disparity results in idle time for at

least one of the components and lowers the overall sys-tem throughput A throughput-optimized system minimizesinstances where certain components of the architecture areidle The proposed optimizations in Sections 4 and 5 clearlyreduce throughput disparity andminimize the idle time of thesystem Thus besides causing delays due to significant over-head standard column-wise DRAM access also contributesto the overall energy consumption This is not only due tohigh count of DRAM row charges but also because of energyconsumed by the FPGA in idle state Ideally maximizing theDRAMbandwidth limits the amount of energy consumptionProposed ldquotile-hoppingrdquo memory mapping scheme improvesthe DRAMbandwidth as seen in Figure 10 and hence reducesthe overall energy consumption The same is true for theproposed OPSD method where reduced DRAM access and1D FFT invocations lead to reduced energy consumption Inthis section we analyze the amount of improvement in energyconsumption based on the proposed optimizations

We estimate the DRAM power consumption for boththe baseline (standard strided) and the optimized (ldquotile-hoppingrdquo) memory access using MICRON DRAM powercalculator The energy is calculated in 119899119869 for each readie energy per read This is accomplished by calculatingthe run time for a specific 2D FFT and estimating theamount of energy consumed by the DRAM power calculatorTable 4 depicts the DRAM energy consumption for 2D FFTsbefore and after the proposed ldquotile-hoppingrdquo optimization forcolumn-wise DRAM access As mentioned earlier row-wiseDRAM access is fast and row buffer size data can be accessedby a single row activation According to Table 4 the energyrequired for DRAM access is reduced by 427 488 and529 for 1024 times 1024 2048 times 2048 and 4096 times 4096 size 2DFFTs respectively

14 International Journal of Reconfigurable Computing

Table 2 Comparison of OPSD1 2D FFT with regular RCD-based implementation

Platform SEAR2 Precision RT3 (ms)YesNo bits 512 times 512 1024 times 1024

Kintex 7 28nm (ours) Yes 16 (fixed) 15 48Kintex 7 28nm (ours) No 16 (fixed) 09 41Kintex 7 28mm [8] Yes 16 (fixed) 324 1167Stratix IV [5] No 64 (double) - 61Virtex-5-BEE3 65nm[14] No 32 (single) 249 1026Virtex-E 180nm [21] No 16 (fixed) 286 769ASIC 180nm No 32 (single) 210 -1 Optimized periodic + smooth decomposition (OPSD)2 Simultaneous edge artifact removal3 Runtime (ms)

Table 3 DRAM energy consumption baseline vs tile-hopping1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPRlowast CW∘ Read(Baseline) 446 577 712EPR CW ReadTile-Hopping 254 295 336(Proposed)Reduction () 427 488 529lowastEnergy per read (EPR)∘Column-wise memory access (CW)

Table 4 2D FFT + EAR energy consumption baseline vs optimized (OPSD + tile hopping)1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPPdagger 2D FFT + EARBaseline 3692 4125 48352D FFT+EAR (Opt)Tile-Hopping + OPSD 1588 1706 1811(Proposed)Improvement 23times 24times 28timesdaggerEnergy per point (EPP)

The metric used to compare the overall energy optimiza-tion achieved for 2D FFTs with EAR is energy per point iethe amount of average energy required to compute the 2DFFT of a single point in an image with simultaneous edgeartifact removal This was achieved by calculating the energyconsumed by Xilinx LogiCORE IP for 1D FFTs the DRAMand the edge artifact removal part separately The estimatedenergy calculated does not include energy consumed by thePXIe chassis and the host PC Essentially the FPGA-basedarchitecture presented here could be used without the hostcontroller The energy consumption incorporates dynamicas well as static power The overall energy consumption perpoint is reduced by 569 586 and 62 for calculating1024 times 1024 2048 times 2048 and 4096 times 4096 size 2D FFTs withEAR respectively

7 Application Filtered Back-Projectionfor Tomography

In order to further demonstrate the effectiveness of ourimplementation we use the created 2D FFT module asan accelerator for reducing the run time for filtered back-projection (FBP) FBP is a fundamental analytical tomo-graphic image reconstruction method In depth detailsregarding the basic FBP algorithm have been left out forbrevity but can be found in [28 29]Themethod can be usedto reconstruct primitive 3D tomograms from 2D data whichcan then be used as a basis for more complex regularization-basedmethods such as [30ndash32]The algorithmic flow is basedon the Fourier slice theorem ie 2D Fourier transformsof projections are an angular component of the 3D Fourier

International Journal of Reconfigurable Computing 15

Table 5 Comparing filtered back-projection (FBP) runtime (as an application for using the proposed 2D FFT with simultaneous EAR)

3D Density CPU (i7) FPGA + Host PC (i7)Sec Sec128 times 128 times 128 213 sec 195 sec256 times 256 times 256 475 sec 424 sec512 times 512 times 512 948 sec 813 sec1024 times 1024 times 1024 3223 sec 2753 sec2048 times 2048 times 2048 16877 sec 13644 sec4096 times 4096 times 4096 164631 sec 125994 sec

Original Phantom CPU Reconstructed Phantom FPGA+CPU Reconstructed Ph

Inte

nsity

Inte

nsity

1

08

06

04

02

0

1

08

06

04

02

0

X Direction0 20 40 60 80 100 120

X Direction0 20 40 60 80 100 120

Original PhantomCPU BP

Original PhantomFPGA+CPU BP

Figure 11 Figure showing a thin slice of filtered back-projectionresults by reconstructing a 128 times 128 times 128 Shepp-Logan phantomThe 3Ddensitywas reconstructed from 180 equally spaced simulatedprojections using standard linearly interpolated FBP and using aRam-Lak filter It can be seen that the results from the FPGA + CPUsolution have some errors this is due to the fact that the 2D FFT ofeach projection is less accurate The results are good enough to beused as a basis for further optimization based refinement methods

transform of the 3D reconstructed volume Our 2D FFTaccelerator was used to calculate the 2D FFTs of the projec-tions as well as for initial stages of the 3D FFTwhich was thencompleted on the host PC Similar to the 2D FFT the 3D FFTis separable and can be divided into 2D FFTs and 1D FFTsThe results have been shown in Table 5 It can be seen thatthe improvement for smaller size densities is not significantbecause their FFTs are quite fast on general-purpose CPUsHowever for larger densities the FFT accelerator can give asignificant improvement If the remaining components arealso implemented on an FPGA significant speed increasecan be achieved Results of a thin slice from a 3D simulatedShepp-Logan [33] phantom have been shown in Figure 11 Itcan be seen that the results from the hardware acceleratedFBP are of slightly lower quality This is due to the factthat our 2D FFT implementation is less accurate (16 bitfixed-point) as compared to the CPU-based implementation(FFTW double-precision floating) The accelerated FBP wasalso tested with real Electron Tomography (ET) data

8 Conclusion

2D FFTs often become a major bottleneck for high-performance imaging and vision systems The inherentcomputational complexity of the 2D FFT kernel is fur-ther enhanced if effective removal (using PSD) of spuriousartifacts introduced by the nonperiodic nature of real-lifeimages is taken into accountWe developed and implementedan FPGA-based design for calculating high-throughput 2DDFTs with simultaneous edge artifact removal Our approachis based on a PSD algorithm that splits the frequency domainof a 2D image into a smooth component which contains thehigh-frequency cross-shaped artifacts and can be subtractedfrom the 2D DFT of the original image to obtain a periodiccomponent that is artifact-free Since this approach calculatestwo 2D DFTs simultaneously external memory addressingand repeated 1D FFT invocations become problematic Tosolve this problem we optimized the original PSD algorithmto reduce the number of DFT samples to be computed andDRAM access Moreover to reduce strided access from theDRAMduring column-wise reads we presented and analyzedldquotile-hoppingrdquo a memorymapping scheme which reduces thenumber of DRAM row activations when reading a singlecolumn of data This memory mapping scheme is generaland may be used for a variety of other applications Wedemonstrate that ldquotile-hoppingrdquomemorymapping can reducethe DRAM energy consumption by 529 Moreover weshow that the proposed optimizations lead to 28times less energyconsumption for the overall 2D FFT with EAR architecture

Our methods were tested using extensive synthesis andbenchmarking using a Xilinx Kintex 7 FPGA communicatingwith a host PC on a high-speed PXIe bus Our system isexpandable to support several FPGAs and can be adaptedto various large-scale computer vision and biomedical appli-cations Despite decomposing the image into periodic andsmooth frequency components our design requires lessrun time compared to traditional FPGA-based 2D DFTimplementation approaches and can be used for a varietyof highly demanding applications One such applicationfiltered back-projection was accelerated using the proposedimplementation to achieve better results specifically for largersize raw tomographic data

Conflicts of Interest

The authors declare that they have no conflicts of interest

16 International Journal of Reconfigurable Computing

Acknowledgments

Thisworkwas supported by JapaneseGovernmentOIST Sub-sidy for Operations (Ulf Skoglund) Grant no 5020S7010020FaisalMahmood andMart Toots were additionally supportedby the OIST PhD Fellowship The authors would like tothank National Instruments Research for their technicalsupport during the design process The authors would alsolike to thank Dr Steven D Aird for language assistance andShizuka Kuda for logistical arrangements

Supplementary Materials

File 2D FFT-EAR-Demomp4 video demonstrating theFPGA output of 2D FFT with edge artifact removal(Supplementary Materials)

References

[1] R N Bracewell The fourier transform and iis applications vol5 New York NY USA 1965

[2] Theory and application of digital signal processing vol 1Prentice-Hall Inc Englewood Cliffs NJ USA 1975

[3] R A Brooks and G Di Chiro ldquoTheory of image reconstructionin computed tomographyrdquo Radiology vol 117 no 3 I pp 561ndash572 1975

[4] L Moisan ldquoPeriodic plus smooth image decompositionrdquo Jour-nal of Mathematical Imaging and Vision vol 39 no 2 pp 161ndash179 2011

[5] B Akin P A Milder F Franchetti and J C Hoe ldquoMemorybandwidth efficient two-dimensional fast Fourier transformalgorithm and implementation for large problem sizesrdquo inProceedings of the 20th IEEE International Symposium on Field-Programmable Custom Computing Machines FCCM 2012 pp188ndash191 Canada May 2012

[6] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[7] H Kee N Petersen J Kornerup and S S BhattacharyyaldquoSystematic generation of FPGA-based FFT implementationsrdquoin Proceedings of the 2008 IEEE International Conference onAcoustics Speech and Signal Processing ICASSP pp 1413ndash1416USA April 2008

[8] F Mahmood M Toots L-G Ofverstedt and U Skoglundldquo2DDiscrete Fourier Transformwith simultaneous edge artifactremoval for real-Time applicationsrdquo in Proceedings of the Inter-national Conference on Field Programmable Technology FPT2015 pp 236ndash239 New Zealand December 2015

[9] M Frigo and S G Johnson ldquoFFTW an adaptive software archi-tecture for the FFTrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing vol 3 pp1381ndash1384 IEEE May 1998

[10] M Puschel J M F Moura J R Johnson et al ldquoSPIRAL codegeneration for DSP transformsrdquo Proceedings of the IEEE vol 93no 2 pp 232ndash273 2005

[11] EWang Q Zhang B Shen et alHigh-Performance Computingon the Intel Xeon Phi Springer International Publishing 2014

[12] B Akin F Franchetti and J C Hoe ldquoFFTS with near-optimalmemory access through block data layoutsrdquo in Proceedings ofthe 2014 IEEE International Conference onAcoustics Speech andSignal Processing ICASSP 2014 pp 3898ndash3902 Italy May 2014

[13] B Akin F Franchetti and J C Hoe ldquoUnderstanding thedesign space of DRAM-optimized hardware FFT acceleratorsrdquoin Proceedings of the 25th IEEE International Conference onApplication-Specific Systems Architectures and Processors ASAP2014 pp 248ndash255 Switzerland June 2014

[14] C-L Yu K Irick C Chakrabarti and V Narayanan ldquoMul-tidimensional DFT IP generator for FPGA platformsrdquo IEEETransactions on Circuits and Systems I Regular Papers vol 58no 4 pp 755ndash764 2011

[15] DHe andQ Sun ldquoApractical print-scan resilientwatermarkingschemerdquo in Proceedings of the IEEE International Conference onImage Processing 2005 ICIP 2005 pp 257ndash260 Italy September2005

[16] D G Bailey Design for embedded image processing on FPGAsJohn Wiley Sons 2011

[17] B G Batchelor ldquoImplementing Machine Vision Systems UsingFPGAsrdquo in Machine Vision Handbook 1136 p 1103 SpringerLondon UK 2012

[18] T Lenart M Gustafsson and V Owall ldquoA hardware accelera-tion platform for digital holographic imagingrdquo Journal of SignalProcessing Systems vol 52 no 3 pp 297ndash311 2008

[19] G H Loh ldquo3D-Stacked Memory Architectures for Multi-coreProcessorsrdquo ACM SIGARCH Computer Architecture News vol36 no 3 pp 453ndash464 2008

[20] Z Qiuling B Akin H E Sumbul et al ldquoA 3D-stacked logic-in-memory accelerator for application-specific data intensivecomputingrdquo in Proceedings of the IEEE International 3D SystemsIntegration Conference pp 1ndash7 October 2013

[21] I S Uzun A Amira and A Bouridane ldquoFPGA implementa-tions of fast Fourier transforms for real-time signal and imageprocessingrdquo pp 283ndash296

[22] T Dillon ldquoTwo Virtex-II FPGAs deliver fastest cheapest besthigh-performance image processing systemrdquo rdquo Xilinx XcellJournal vol 41 pp 70ndash73 2001

[23] A Hast ldquoRobust and invariant phase based local featurematchingrdquo in Proceedings of the 22nd International Conferenceon Pattern Recognition ICPR 2014 pp 809ndash814 Sweden August2014

[24] B Galerne Y Gousseau and J-M Morel ldquoRandom phasetextures theory and synthesisrdquo IEEE Transactions on ImageProcessing vol 20 no 1 pp 257ndash267 2011

[25] R Hovden Y Jiang H L Xin and L F Kourkoutis ldquoPeriodicArtifact Reduction in Fourier Transforms of Full Field AtomicResolution Imagesrdquo Microscopy and Microanalysis vol 21 no2 pp 436ndash441 2014

[26] D G Bailey ldquoThe advantages and limitations of high levelsynthesis for FPGA based image processingrdquo in Proceedings ofthe the 9th International Conference pp 134ndash139 Seville SpainSeptember 2015

[27] W Wang B Duan C Zhang P Zhang and N Sun ldquoAcceler-ating 2D FFT with non-power-of-two problem size on FPGArdquoin Proceedings of the 2010 International Conference on Recon-figurable Computing and FPGAs ReConFig 2010 pp 208ndash213Mexico December 2010

[28] F NattererTheMathematics of Computerized Tomography JohnWiley amp Sons 1986

[29] R A Crowther D J DeRosier and A Klug ldquoThe Reconstruc-tion of aThree-Dimensional Structure from Projections and itsApplication to Electron Microscopyrdquo Proceedings of the RoyalSociety A Mathematical Physical and Engineering Sciences vol317 no 1530 pp 319ndash340 1970

International Journal of Reconfigurable Computing 17

[30] U Skoglund L-G Ofverstedt R M Burnett and G BricogneldquoMaximum-entropy three-dimensional reconstruction withdeconvolution of the contrast transfer function A test applica-tion with adenovirusrdquo Journal of Structural Biology vol 117 no3 pp 173ndash188 1996

[31] F Mahmood N Shahid P Vandergheynst and U SkoglundldquoGraph-based sinogram denoising for tomographic reconstruc-tionsrdquo in Proceedings of the 38th Annual International Confer-ence of the IEEE Engineering in Medicine and Biology SocietyEMBC 2016 pp 3961ndash3964 USA August 2016

[32] F Mahmood N Shahid U Skoglund and P VandergheynstldquoAdaptive Graph-Based Total Variation for TomographicReconstructionsrdquo IEEE Signal Processing Letters vol 25 no 5pp 700ndash704 2018

[33] L A Shepp and B F Logan Jr ldquoThe Fourier reconstruction of ahead sectionrdquo IEEE Transactions on Nuclear Science vol 21 no3 pp 21ndash43 1974

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 13: Algorithm and Architecture Optimization for 2D Discrete ...downloads.hindawi.com/journals/ijrc/2018/1403181.pdf · Algorithm and Architecture Optimization for 2D Discrete Fourier

International Journal of Reconfigurable Computing 13

Fram

esS

econ

d (1

s)

105

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7)Kintex-7 (No DRAM Opt)Kintex-7 (DRAM Opt)Theoretical Peak

(a) 2D FFT performance DRAM tile-hopping

Fram

esS

econ

d (1

s)

104

103

102

101

100

2D FFT Size

128x x x x x x

128

256

256

512 1024 2048 4096

512 1024 2048 4096

CPU(i7) (PSD)Kintex-7 (PSD)Kintex-7 (OPSD)

(b) 2D FFT with EAR performance PSD versus OPSD

Figure 10 Performance evaluation in terms of frames per second for (a) 2D FFTs with tile-hopping memory pattern (b) 2D FFTs with edgeartifact removal (EAR) using OPSDThe performance evaluation shows the significance of the two optimizations proposed Both axes are ona log scale

63 Performance Evaluation The overall performance of thesystem was evaluated using the setup presented in Figure 9The data was streamed from the host PC in certain caseshigh frame rate videos as well as direct camera input werestreamed from the host All results presented are for 16-bitfixed-point precision Figure 10(a) presents the effectivenessof our proposed tile-hopping memory mapping scheme Itclearly shows the effectiveness of our proposed memorymapping since it is closer to the theoretical peak performanceFigure 10(b) presents the overall results comparing PSD andOPSD and demonstrating the effectiveness of our proposedoptimization PSD was also implemented on the same plat-form but the optimization presented in Section 4 was notused This rendered the serial portion of the algorithm to bethe bottleneck which reduced overall performance Table 2shows a comparison of our implementation in contrast torecent 2D FFT FPGA implementation approaches and showsthat we achieve a better performance even with simultaneousedge artifact removal Although our implementation is testedwith 16-bit fixed-point precision which limits the accuracy ofthe transform the precision may be sufficient for a varietyof speed critical applications where alternative edge artifactremoval methods (eg filtering) may decrease overall systemperformance Although the dynamic range of fixed-pointdata is smaller than floating-point data which can lead toerrors in the 2D FFT for certain applications it is moreimportant to have an artifact-free transform as compared toa highly accurate transform

64 Energy Evaluation The overall energy consumptionof the custom computing system depends on (1) powerperformance of the system components and (2) throughputdisparity Throughput disparity results in idle time for at

least one of the components and lowers the overall sys-tem throughput A throughput-optimized system minimizesinstances where certain components of the architecture areidle The proposed optimizations in Sections 4 and 5 clearlyreduce throughput disparity andminimize the idle time of thesystem Thus besides causing delays due to significant over-head standard column-wise DRAM access also contributesto the overall energy consumption This is not only due tohigh count of DRAM row charges but also because of energyconsumed by the FPGA in idle state Ideally maximizing theDRAMbandwidth limits the amount of energy consumptionProposed ldquotile-hoppingrdquo memory mapping scheme improvesthe DRAMbandwidth as seen in Figure 10 and hence reducesthe overall energy consumption The same is true for theproposed OPSD method where reduced DRAM access and1D FFT invocations lead to reduced energy consumption Inthis section we analyze the amount of improvement in energyconsumption based on the proposed optimizations

We estimate the DRAM power consumption for boththe baseline (standard strided) and the optimized (ldquotile-hoppingrdquo) memory access using MICRON DRAM powercalculator The energy is calculated in 119899119869 for each readie energy per read This is accomplished by calculatingthe run time for a specific 2D FFT and estimating theamount of energy consumed by the DRAM power calculatorTable 4 depicts the DRAM energy consumption for 2D FFTsbefore and after the proposed ldquotile-hoppingrdquo optimization forcolumn-wise DRAM access As mentioned earlier row-wiseDRAM access is fast and row buffer size data can be accessedby a single row activation According to Table 4 the energyrequired for DRAM access is reduced by 427 488 and529 for 1024 times 1024 2048 times 2048 and 4096 times 4096 size 2DFFTs respectively

14 International Journal of Reconfigurable Computing

Table 2 Comparison of OPSD1 2D FFT with regular RCD-based implementation

Platform SEAR2 Precision RT3 (ms)YesNo bits 512 times 512 1024 times 1024

Kintex 7 28nm (ours) Yes 16 (fixed) 15 48Kintex 7 28nm (ours) No 16 (fixed) 09 41Kintex 7 28mm [8] Yes 16 (fixed) 324 1167Stratix IV [5] No 64 (double) - 61Virtex-5-BEE3 65nm[14] No 32 (single) 249 1026Virtex-E 180nm [21] No 16 (fixed) 286 769ASIC 180nm No 32 (single) 210 -1 Optimized periodic + smooth decomposition (OPSD)2 Simultaneous edge artifact removal3 Runtime (ms)

Table 3 DRAM energy consumption baseline vs tile-hopping1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPRlowast CW∘ Read(Baseline) 446 577 712EPR CW ReadTile-Hopping 254 295 336(Proposed)Reduction () 427 488 529lowastEnergy per read (EPR)∘Column-wise memory access (CW)

Table 4 2D FFT + EAR energy consumption baseline vs optimized (OPSD + tile hopping)1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPPdagger 2D FFT + EARBaseline 3692 4125 48352D FFT+EAR (Opt)Tile-Hopping + OPSD 1588 1706 1811(Proposed)Improvement 23times 24times 28timesdaggerEnergy per point (EPP)

The metric used to compare the overall energy optimiza-tion achieved for 2D FFTs with EAR is energy per point iethe amount of average energy required to compute the 2DFFT of a single point in an image with simultaneous edgeartifact removal This was achieved by calculating the energyconsumed by Xilinx LogiCORE IP for 1D FFTs the DRAMand the edge artifact removal part separately The estimatedenergy calculated does not include energy consumed by thePXIe chassis and the host PC Essentially the FPGA-basedarchitecture presented here could be used without the hostcontroller The energy consumption incorporates dynamicas well as static power The overall energy consumption perpoint is reduced by 569 586 and 62 for calculating1024 times 1024 2048 times 2048 and 4096 times 4096 size 2D FFTs withEAR respectively

7 Application Filtered Back-Projectionfor Tomography

In order to further demonstrate the effectiveness of ourimplementation we use the created 2D FFT module asan accelerator for reducing the run time for filtered back-projection (FBP) FBP is a fundamental analytical tomo-graphic image reconstruction method In depth detailsregarding the basic FBP algorithm have been left out forbrevity but can be found in [28 29]Themethod can be usedto reconstruct primitive 3D tomograms from 2D data whichcan then be used as a basis for more complex regularization-basedmethods such as [30ndash32]The algorithmic flow is basedon the Fourier slice theorem ie 2D Fourier transformsof projections are an angular component of the 3D Fourier

International Journal of Reconfigurable Computing 15

Table 5 Comparing filtered back-projection (FBP) runtime (as an application for using the proposed 2D FFT with simultaneous EAR)

3D Density CPU (i7) FPGA + Host PC (i7)Sec Sec128 times 128 times 128 213 sec 195 sec256 times 256 times 256 475 sec 424 sec512 times 512 times 512 948 sec 813 sec1024 times 1024 times 1024 3223 sec 2753 sec2048 times 2048 times 2048 16877 sec 13644 sec4096 times 4096 times 4096 164631 sec 125994 sec

Original Phantom CPU Reconstructed Phantom FPGA+CPU Reconstructed Ph

Inte

nsity

Inte

nsity

1

08

06

04

02

0

1

08

06

04

02

0

X Direction0 20 40 60 80 100 120

X Direction0 20 40 60 80 100 120

Original PhantomCPU BP

Original PhantomFPGA+CPU BP

Figure 11 Figure showing a thin slice of filtered back-projectionresults by reconstructing a 128 times 128 times 128 Shepp-Logan phantomThe 3Ddensitywas reconstructed from 180 equally spaced simulatedprojections using standard linearly interpolated FBP and using aRam-Lak filter It can be seen that the results from the FPGA + CPUsolution have some errors this is due to the fact that the 2D FFT ofeach projection is less accurate The results are good enough to beused as a basis for further optimization based refinement methods

transform of the 3D reconstructed volume Our 2D FFTaccelerator was used to calculate the 2D FFTs of the projec-tions as well as for initial stages of the 3D FFTwhich was thencompleted on the host PC Similar to the 2D FFT the 3D FFTis separable and can be divided into 2D FFTs and 1D FFTsThe results have been shown in Table 5 It can be seen thatthe improvement for smaller size densities is not significantbecause their FFTs are quite fast on general-purpose CPUsHowever for larger densities the FFT accelerator can give asignificant improvement If the remaining components arealso implemented on an FPGA significant speed increasecan be achieved Results of a thin slice from a 3D simulatedShepp-Logan [33] phantom have been shown in Figure 11 Itcan be seen that the results from the hardware acceleratedFBP are of slightly lower quality This is due to the factthat our 2D FFT implementation is less accurate (16 bitfixed-point) as compared to the CPU-based implementation(FFTW double-precision floating) The accelerated FBP wasalso tested with real Electron Tomography (ET) data

8 Conclusion

2D FFTs often become a major bottleneck for high-performance imaging and vision systems The inherentcomputational complexity of the 2D FFT kernel is fur-ther enhanced if effective removal (using PSD) of spuriousartifacts introduced by the nonperiodic nature of real-lifeimages is taken into accountWe developed and implementedan FPGA-based design for calculating high-throughput 2DDFTs with simultaneous edge artifact removal Our approachis based on a PSD algorithm that splits the frequency domainof a 2D image into a smooth component which contains thehigh-frequency cross-shaped artifacts and can be subtractedfrom the 2D DFT of the original image to obtain a periodiccomponent that is artifact-free Since this approach calculatestwo 2D DFTs simultaneously external memory addressingand repeated 1D FFT invocations become problematic Tosolve this problem we optimized the original PSD algorithmto reduce the number of DFT samples to be computed andDRAM access Moreover to reduce strided access from theDRAMduring column-wise reads we presented and analyzedldquotile-hoppingrdquo a memorymapping scheme which reduces thenumber of DRAM row activations when reading a singlecolumn of data This memory mapping scheme is generaland may be used for a variety of other applications Wedemonstrate that ldquotile-hoppingrdquomemorymapping can reducethe DRAM energy consumption by 529 Moreover weshow that the proposed optimizations lead to 28times less energyconsumption for the overall 2D FFT with EAR architecture

Our methods were tested using extensive synthesis andbenchmarking using a Xilinx Kintex 7 FPGA communicatingwith a host PC on a high-speed PXIe bus Our system isexpandable to support several FPGAs and can be adaptedto various large-scale computer vision and biomedical appli-cations Despite decomposing the image into periodic andsmooth frequency components our design requires lessrun time compared to traditional FPGA-based 2D DFTimplementation approaches and can be used for a varietyof highly demanding applications One such applicationfiltered back-projection was accelerated using the proposedimplementation to achieve better results specifically for largersize raw tomographic data

Conflicts of Interest

The authors declare that they have no conflicts of interest

16 International Journal of Reconfigurable Computing

Acknowledgments

Thisworkwas supported by JapaneseGovernmentOIST Sub-sidy for Operations (Ulf Skoglund) Grant no 5020S7010020FaisalMahmood andMart Toots were additionally supportedby the OIST PhD Fellowship The authors would like tothank National Instruments Research for their technicalsupport during the design process The authors would alsolike to thank Dr Steven D Aird for language assistance andShizuka Kuda for logistical arrangements

Supplementary Materials

File 2D FFT-EAR-Demomp4 video demonstrating theFPGA output of 2D FFT with edge artifact removal(Supplementary Materials)

References

[1] R N Bracewell The fourier transform and iis applications vol5 New York NY USA 1965

[2] Theory and application of digital signal processing vol 1Prentice-Hall Inc Englewood Cliffs NJ USA 1975

[3] R A Brooks and G Di Chiro ldquoTheory of image reconstructionin computed tomographyrdquo Radiology vol 117 no 3 I pp 561ndash572 1975

[4] L Moisan ldquoPeriodic plus smooth image decompositionrdquo Jour-nal of Mathematical Imaging and Vision vol 39 no 2 pp 161ndash179 2011

[5] B Akin P A Milder F Franchetti and J C Hoe ldquoMemorybandwidth efficient two-dimensional fast Fourier transformalgorithm and implementation for large problem sizesrdquo inProceedings of the 20th IEEE International Symposium on Field-Programmable Custom Computing Machines FCCM 2012 pp188ndash191 Canada May 2012

[6] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[7] H Kee N Petersen J Kornerup and S S BhattacharyyaldquoSystematic generation of FPGA-based FFT implementationsrdquoin Proceedings of the 2008 IEEE International Conference onAcoustics Speech and Signal Processing ICASSP pp 1413ndash1416USA April 2008

[8] F Mahmood M Toots L-G Ofverstedt and U Skoglundldquo2DDiscrete Fourier Transformwith simultaneous edge artifactremoval for real-Time applicationsrdquo in Proceedings of the Inter-national Conference on Field Programmable Technology FPT2015 pp 236ndash239 New Zealand December 2015

[9] M Frigo and S G Johnson ldquoFFTW an adaptive software archi-tecture for the FFTrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing vol 3 pp1381ndash1384 IEEE May 1998

[10] M Puschel J M F Moura J R Johnson et al ldquoSPIRAL codegeneration for DSP transformsrdquo Proceedings of the IEEE vol 93no 2 pp 232ndash273 2005

[11] EWang Q Zhang B Shen et alHigh-Performance Computingon the Intel Xeon Phi Springer International Publishing 2014

[12] B Akin F Franchetti and J C Hoe ldquoFFTS with near-optimalmemory access through block data layoutsrdquo in Proceedings ofthe 2014 IEEE International Conference onAcoustics Speech andSignal Processing ICASSP 2014 pp 3898ndash3902 Italy May 2014

[13] B Akin F Franchetti and J C Hoe ldquoUnderstanding thedesign space of DRAM-optimized hardware FFT acceleratorsrdquoin Proceedings of the 25th IEEE International Conference onApplication-Specific Systems Architectures and Processors ASAP2014 pp 248ndash255 Switzerland June 2014

[14] C-L Yu K Irick C Chakrabarti and V Narayanan ldquoMul-tidimensional DFT IP generator for FPGA platformsrdquo IEEETransactions on Circuits and Systems I Regular Papers vol 58no 4 pp 755ndash764 2011

[15] DHe andQ Sun ldquoApractical print-scan resilientwatermarkingschemerdquo in Proceedings of the IEEE International Conference onImage Processing 2005 ICIP 2005 pp 257ndash260 Italy September2005

[16] D G Bailey Design for embedded image processing on FPGAsJohn Wiley Sons 2011

[17] B G Batchelor ldquoImplementing Machine Vision Systems UsingFPGAsrdquo in Machine Vision Handbook 1136 p 1103 SpringerLondon UK 2012

[18] T Lenart M Gustafsson and V Owall ldquoA hardware accelera-tion platform for digital holographic imagingrdquo Journal of SignalProcessing Systems vol 52 no 3 pp 297ndash311 2008

[19] G H Loh ldquo3D-Stacked Memory Architectures for Multi-coreProcessorsrdquo ACM SIGARCH Computer Architecture News vol36 no 3 pp 453ndash464 2008

[20] Z Qiuling B Akin H E Sumbul et al ldquoA 3D-stacked logic-in-memory accelerator for application-specific data intensivecomputingrdquo in Proceedings of the IEEE International 3D SystemsIntegration Conference pp 1ndash7 October 2013

[21] I S Uzun A Amira and A Bouridane ldquoFPGA implementa-tions of fast Fourier transforms for real-time signal and imageprocessingrdquo pp 283ndash296

[22] T Dillon ldquoTwo Virtex-II FPGAs deliver fastest cheapest besthigh-performance image processing systemrdquo rdquo Xilinx XcellJournal vol 41 pp 70ndash73 2001

[23] A Hast ldquoRobust and invariant phase based local featurematchingrdquo in Proceedings of the 22nd International Conferenceon Pattern Recognition ICPR 2014 pp 809ndash814 Sweden August2014

[24] B Galerne Y Gousseau and J-M Morel ldquoRandom phasetextures theory and synthesisrdquo IEEE Transactions on ImageProcessing vol 20 no 1 pp 257ndash267 2011

[25] R Hovden Y Jiang H L Xin and L F Kourkoutis ldquoPeriodicArtifact Reduction in Fourier Transforms of Full Field AtomicResolution Imagesrdquo Microscopy and Microanalysis vol 21 no2 pp 436ndash441 2014

[26] D G Bailey ldquoThe advantages and limitations of high levelsynthesis for FPGA based image processingrdquo in Proceedings ofthe the 9th International Conference pp 134ndash139 Seville SpainSeptember 2015

[27] W Wang B Duan C Zhang P Zhang and N Sun ldquoAcceler-ating 2D FFT with non-power-of-two problem size on FPGArdquoin Proceedings of the 2010 International Conference on Recon-figurable Computing and FPGAs ReConFig 2010 pp 208ndash213Mexico December 2010

[28] F NattererTheMathematics of Computerized Tomography JohnWiley amp Sons 1986

[29] R A Crowther D J DeRosier and A Klug ldquoThe Reconstruc-tion of aThree-Dimensional Structure from Projections and itsApplication to Electron Microscopyrdquo Proceedings of the RoyalSociety A Mathematical Physical and Engineering Sciences vol317 no 1530 pp 319ndash340 1970

International Journal of Reconfigurable Computing 17

[30] U Skoglund L-G Ofverstedt R M Burnett and G BricogneldquoMaximum-entropy three-dimensional reconstruction withdeconvolution of the contrast transfer function A test applica-tion with adenovirusrdquo Journal of Structural Biology vol 117 no3 pp 173ndash188 1996

[31] F Mahmood N Shahid P Vandergheynst and U SkoglundldquoGraph-based sinogram denoising for tomographic reconstruc-tionsrdquo in Proceedings of the 38th Annual International Confer-ence of the IEEE Engineering in Medicine and Biology SocietyEMBC 2016 pp 3961ndash3964 USA August 2016

[32] F Mahmood N Shahid U Skoglund and P VandergheynstldquoAdaptive Graph-Based Total Variation for TomographicReconstructionsrdquo IEEE Signal Processing Letters vol 25 no 5pp 700ndash704 2018

[33] L A Shepp and B F Logan Jr ldquoThe Fourier reconstruction of ahead sectionrdquo IEEE Transactions on Nuclear Science vol 21 no3 pp 21ndash43 1974

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 14: Algorithm and Architecture Optimization for 2D Discrete ...downloads.hindawi.com/journals/ijrc/2018/1403181.pdf · Algorithm and Architecture Optimization for 2D Discrete Fourier

14 International Journal of Reconfigurable Computing

Table 2 Comparison of OPSD1 2D FFT with regular RCD-based implementation

Platform SEAR2 Precision RT3 (ms)YesNo bits 512 times 512 1024 times 1024

Kintex 7 28nm (ours) Yes 16 (fixed) 15 48Kintex 7 28nm (ours) No 16 (fixed) 09 41Kintex 7 28mm [8] Yes 16 (fixed) 324 1167Stratix IV [5] No 64 (double) - 61Virtex-5-BEE3 65nm[14] No 32 (single) 249 1026Virtex-E 180nm [21] No 16 (fixed) 286 769ASIC 180nm No 32 (single) 210 -1 Optimized periodic + smooth decomposition (OPSD)2 Simultaneous edge artifact removal3 Runtime (ms)

Table 3 DRAM energy consumption baseline vs tile-hopping1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPRlowast CW∘ Read(Baseline) 446 577 712EPR CW ReadTile-Hopping 254 295 336(Proposed)Reduction () 427 488 529lowastEnergy per read (EPR)∘Column-wise memory access (CW)

Table 4 2D FFT + EAR energy consumption baseline vs optimized (OPSD + tile hopping)1024 times 1024 2048 times 2048 4096 times 4096119899119869119900119906119897119890 119899119869119900119906119897119890 119899119869119900119906119897119890EPPdagger 2D FFT + EARBaseline 3692 4125 48352D FFT+EAR (Opt)Tile-Hopping + OPSD 1588 1706 1811(Proposed)Improvement 23times 24times 28timesdaggerEnergy per point (EPP)

The metric used to compare the overall energy optimiza-tion achieved for 2D FFTs with EAR is energy per point iethe amount of average energy required to compute the 2DFFT of a single point in an image with simultaneous edgeartifact removal This was achieved by calculating the energyconsumed by Xilinx LogiCORE IP for 1D FFTs the DRAMand the edge artifact removal part separately The estimatedenergy calculated does not include energy consumed by thePXIe chassis and the host PC Essentially the FPGA-basedarchitecture presented here could be used without the hostcontroller The energy consumption incorporates dynamicas well as static power The overall energy consumption perpoint is reduced by 569 586 and 62 for calculating1024 times 1024 2048 times 2048 and 4096 times 4096 size 2D FFTs withEAR respectively

7 Application Filtered Back-Projectionfor Tomography

In order to further demonstrate the effectiveness of ourimplementation we use the created 2D FFT module asan accelerator for reducing the run time for filtered back-projection (FBP) FBP is a fundamental analytical tomo-graphic image reconstruction method In depth detailsregarding the basic FBP algorithm have been left out forbrevity but can be found in [28 29]Themethod can be usedto reconstruct primitive 3D tomograms from 2D data whichcan then be used as a basis for more complex regularization-basedmethods such as [30ndash32]The algorithmic flow is basedon the Fourier slice theorem ie 2D Fourier transformsof projections are an angular component of the 3D Fourier

International Journal of Reconfigurable Computing 15

Table 5 Comparing filtered back-projection (FBP) runtime (as an application for using the proposed 2D FFT with simultaneous EAR)

3D Density CPU (i7) FPGA + Host PC (i7)Sec Sec128 times 128 times 128 213 sec 195 sec256 times 256 times 256 475 sec 424 sec512 times 512 times 512 948 sec 813 sec1024 times 1024 times 1024 3223 sec 2753 sec2048 times 2048 times 2048 16877 sec 13644 sec4096 times 4096 times 4096 164631 sec 125994 sec

Original Phantom CPU Reconstructed Phantom FPGA+CPU Reconstructed Ph

Inte

nsity

Inte

nsity

1

08

06

04

02

0

1

08

06

04

02

0

X Direction0 20 40 60 80 100 120

X Direction0 20 40 60 80 100 120

Original PhantomCPU BP

Original PhantomFPGA+CPU BP

Figure 11 Figure showing a thin slice of filtered back-projectionresults by reconstructing a 128 times 128 times 128 Shepp-Logan phantomThe 3Ddensitywas reconstructed from 180 equally spaced simulatedprojections using standard linearly interpolated FBP and using aRam-Lak filter It can be seen that the results from the FPGA + CPUsolution have some errors this is due to the fact that the 2D FFT ofeach projection is less accurate The results are good enough to beused as a basis for further optimization based refinement methods

transform of the 3D reconstructed volume Our 2D FFTaccelerator was used to calculate the 2D FFTs of the projec-tions as well as for initial stages of the 3D FFTwhich was thencompleted on the host PC Similar to the 2D FFT the 3D FFTis separable and can be divided into 2D FFTs and 1D FFTsThe results have been shown in Table 5 It can be seen thatthe improvement for smaller size densities is not significantbecause their FFTs are quite fast on general-purpose CPUsHowever for larger densities the FFT accelerator can give asignificant improvement If the remaining components arealso implemented on an FPGA significant speed increasecan be achieved Results of a thin slice from a 3D simulatedShepp-Logan [33] phantom have been shown in Figure 11 Itcan be seen that the results from the hardware acceleratedFBP are of slightly lower quality This is due to the factthat our 2D FFT implementation is less accurate (16 bitfixed-point) as compared to the CPU-based implementation(FFTW double-precision floating) The accelerated FBP wasalso tested with real Electron Tomography (ET) data

8 Conclusion

2D FFTs often become a major bottleneck for high-performance imaging and vision systems The inherentcomputational complexity of the 2D FFT kernel is fur-ther enhanced if effective removal (using PSD) of spuriousartifacts introduced by the nonperiodic nature of real-lifeimages is taken into accountWe developed and implementedan FPGA-based design for calculating high-throughput 2DDFTs with simultaneous edge artifact removal Our approachis based on a PSD algorithm that splits the frequency domainof a 2D image into a smooth component which contains thehigh-frequency cross-shaped artifacts and can be subtractedfrom the 2D DFT of the original image to obtain a periodiccomponent that is artifact-free Since this approach calculatestwo 2D DFTs simultaneously external memory addressingand repeated 1D FFT invocations become problematic Tosolve this problem we optimized the original PSD algorithmto reduce the number of DFT samples to be computed andDRAM access Moreover to reduce strided access from theDRAMduring column-wise reads we presented and analyzedldquotile-hoppingrdquo a memorymapping scheme which reduces thenumber of DRAM row activations when reading a singlecolumn of data This memory mapping scheme is generaland may be used for a variety of other applications Wedemonstrate that ldquotile-hoppingrdquomemorymapping can reducethe DRAM energy consumption by 529 Moreover weshow that the proposed optimizations lead to 28times less energyconsumption for the overall 2D FFT with EAR architecture

Our methods were tested using extensive synthesis andbenchmarking using a Xilinx Kintex 7 FPGA communicatingwith a host PC on a high-speed PXIe bus Our system isexpandable to support several FPGAs and can be adaptedto various large-scale computer vision and biomedical appli-cations Despite decomposing the image into periodic andsmooth frequency components our design requires lessrun time compared to traditional FPGA-based 2D DFTimplementation approaches and can be used for a varietyof highly demanding applications One such applicationfiltered back-projection was accelerated using the proposedimplementation to achieve better results specifically for largersize raw tomographic data

Conflicts of Interest

The authors declare that they have no conflicts of interest

16 International Journal of Reconfigurable Computing

Acknowledgments

Thisworkwas supported by JapaneseGovernmentOIST Sub-sidy for Operations (Ulf Skoglund) Grant no 5020S7010020FaisalMahmood andMart Toots were additionally supportedby the OIST PhD Fellowship The authors would like tothank National Instruments Research for their technicalsupport during the design process The authors would alsolike to thank Dr Steven D Aird for language assistance andShizuka Kuda for logistical arrangements

Supplementary Materials

File 2D FFT-EAR-Demomp4 video demonstrating theFPGA output of 2D FFT with edge artifact removal(Supplementary Materials)

References

[1] R N Bracewell The fourier transform and iis applications vol5 New York NY USA 1965

[2] Theory and application of digital signal processing vol 1Prentice-Hall Inc Englewood Cliffs NJ USA 1975

[3] R A Brooks and G Di Chiro ldquoTheory of image reconstructionin computed tomographyrdquo Radiology vol 117 no 3 I pp 561ndash572 1975

[4] L Moisan ldquoPeriodic plus smooth image decompositionrdquo Jour-nal of Mathematical Imaging and Vision vol 39 no 2 pp 161ndash179 2011

[5] B Akin P A Milder F Franchetti and J C Hoe ldquoMemorybandwidth efficient two-dimensional fast Fourier transformalgorithm and implementation for large problem sizesrdquo inProceedings of the 20th IEEE International Symposium on Field-Programmable Custom Computing Machines FCCM 2012 pp188ndash191 Canada May 2012

[6] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[7] H Kee N Petersen J Kornerup and S S BhattacharyyaldquoSystematic generation of FPGA-based FFT implementationsrdquoin Proceedings of the 2008 IEEE International Conference onAcoustics Speech and Signal Processing ICASSP pp 1413ndash1416USA April 2008

[8] F Mahmood M Toots L-G Ofverstedt and U Skoglundldquo2DDiscrete Fourier Transformwith simultaneous edge artifactremoval for real-Time applicationsrdquo in Proceedings of the Inter-national Conference on Field Programmable Technology FPT2015 pp 236ndash239 New Zealand December 2015

[9] M Frigo and S G Johnson ldquoFFTW an adaptive software archi-tecture for the FFTrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing vol 3 pp1381ndash1384 IEEE May 1998

[10] M Puschel J M F Moura J R Johnson et al ldquoSPIRAL codegeneration for DSP transformsrdquo Proceedings of the IEEE vol 93no 2 pp 232ndash273 2005

[11] EWang Q Zhang B Shen et alHigh-Performance Computingon the Intel Xeon Phi Springer International Publishing 2014

[12] B Akin F Franchetti and J C Hoe ldquoFFTS with near-optimalmemory access through block data layoutsrdquo in Proceedings ofthe 2014 IEEE International Conference onAcoustics Speech andSignal Processing ICASSP 2014 pp 3898ndash3902 Italy May 2014

[13] B Akin F Franchetti and J C Hoe ldquoUnderstanding thedesign space of DRAM-optimized hardware FFT acceleratorsrdquoin Proceedings of the 25th IEEE International Conference onApplication-Specific Systems Architectures and Processors ASAP2014 pp 248ndash255 Switzerland June 2014

[14] C-L Yu K Irick C Chakrabarti and V Narayanan ldquoMul-tidimensional DFT IP generator for FPGA platformsrdquo IEEETransactions on Circuits and Systems I Regular Papers vol 58no 4 pp 755ndash764 2011

[15] DHe andQ Sun ldquoApractical print-scan resilientwatermarkingschemerdquo in Proceedings of the IEEE International Conference onImage Processing 2005 ICIP 2005 pp 257ndash260 Italy September2005

[16] D G Bailey Design for embedded image processing on FPGAsJohn Wiley Sons 2011

[17] B G Batchelor ldquoImplementing Machine Vision Systems UsingFPGAsrdquo in Machine Vision Handbook 1136 p 1103 SpringerLondon UK 2012

[18] T Lenart M Gustafsson and V Owall ldquoA hardware accelera-tion platform for digital holographic imagingrdquo Journal of SignalProcessing Systems vol 52 no 3 pp 297ndash311 2008

[19] G H Loh ldquo3D-Stacked Memory Architectures for Multi-coreProcessorsrdquo ACM SIGARCH Computer Architecture News vol36 no 3 pp 453ndash464 2008

[20] Z Qiuling B Akin H E Sumbul et al ldquoA 3D-stacked logic-in-memory accelerator for application-specific data intensivecomputingrdquo in Proceedings of the IEEE International 3D SystemsIntegration Conference pp 1ndash7 October 2013

[21] I S Uzun A Amira and A Bouridane ldquoFPGA implementa-tions of fast Fourier transforms for real-time signal and imageprocessingrdquo pp 283ndash296

[22] T Dillon ldquoTwo Virtex-II FPGAs deliver fastest cheapest besthigh-performance image processing systemrdquo rdquo Xilinx XcellJournal vol 41 pp 70ndash73 2001

[23] A Hast ldquoRobust and invariant phase based local featurematchingrdquo in Proceedings of the 22nd International Conferenceon Pattern Recognition ICPR 2014 pp 809ndash814 Sweden August2014

[24] B Galerne Y Gousseau and J-M Morel ldquoRandom phasetextures theory and synthesisrdquo IEEE Transactions on ImageProcessing vol 20 no 1 pp 257ndash267 2011

[25] R Hovden Y Jiang H L Xin and L F Kourkoutis ldquoPeriodicArtifact Reduction in Fourier Transforms of Full Field AtomicResolution Imagesrdquo Microscopy and Microanalysis vol 21 no2 pp 436ndash441 2014

[26] D G Bailey ldquoThe advantages and limitations of high levelsynthesis for FPGA based image processingrdquo in Proceedings ofthe the 9th International Conference pp 134ndash139 Seville SpainSeptember 2015

[27] W Wang B Duan C Zhang P Zhang and N Sun ldquoAcceler-ating 2D FFT with non-power-of-two problem size on FPGArdquoin Proceedings of the 2010 International Conference on Recon-figurable Computing and FPGAs ReConFig 2010 pp 208ndash213Mexico December 2010

[28] F NattererTheMathematics of Computerized Tomography JohnWiley amp Sons 1986

[29] R A Crowther D J DeRosier and A Klug ldquoThe Reconstruc-tion of aThree-Dimensional Structure from Projections and itsApplication to Electron Microscopyrdquo Proceedings of the RoyalSociety A Mathematical Physical and Engineering Sciences vol317 no 1530 pp 319ndash340 1970

International Journal of Reconfigurable Computing 17

[30] U Skoglund L-G Ofverstedt R M Burnett and G BricogneldquoMaximum-entropy three-dimensional reconstruction withdeconvolution of the contrast transfer function A test applica-tion with adenovirusrdquo Journal of Structural Biology vol 117 no3 pp 173ndash188 1996

[31] F Mahmood N Shahid P Vandergheynst and U SkoglundldquoGraph-based sinogram denoising for tomographic reconstruc-tionsrdquo in Proceedings of the 38th Annual International Confer-ence of the IEEE Engineering in Medicine and Biology SocietyEMBC 2016 pp 3961ndash3964 USA August 2016

[32] F Mahmood N Shahid U Skoglund and P VandergheynstldquoAdaptive Graph-Based Total Variation for TomographicReconstructionsrdquo IEEE Signal Processing Letters vol 25 no 5pp 700ndash704 2018

[33] L A Shepp and B F Logan Jr ldquoThe Fourier reconstruction of ahead sectionrdquo IEEE Transactions on Nuclear Science vol 21 no3 pp 21ndash43 1974

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 15: Algorithm and Architecture Optimization for 2D Discrete ...downloads.hindawi.com/journals/ijrc/2018/1403181.pdf · Algorithm and Architecture Optimization for 2D Discrete Fourier

International Journal of Reconfigurable Computing 15

Table 5 Comparing filtered back-projection (FBP) runtime (as an application for using the proposed 2D FFT with simultaneous EAR)

3D Density CPU (i7) FPGA + Host PC (i7)Sec Sec128 times 128 times 128 213 sec 195 sec256 times 256 times 256 475 sec 424 sec512 times 512 times 512 948 sec 813 sec1024 times 1024 times 1024 3223 sec 2753 sec2048 times 2048 times 2048 16877 sec 13644 sec4096 times 4096 times 4096 164631 sec 125994 sec

Original Phantom CPU Reconstructed Phantom FPGA+CPU Reconstructed Ph

Inte

nsity

Inte

nsity

1

08

06

04

02

0

1

08

06

04

02

0

X Direction0 20 40 60 80 100 120

X Direction0 20 40 60 80 100 120

Original PhantomCPU BP

Original PhantomFPGA+CPU BP

Figure 11 Figure showing a thin slice of filtered back-projectionresults by reconstructing a 128 times 128 times 128 Shepp-Logan phantomThe 3Ddensitywas reconstructed from 180 equally spaced simulatedprojections using standard linearly interpolated FBP and using aRam-Lak filter It can be seen that the results from the FPGA + CPUsolution have some errors this is due to the fact that the 2D FFT ofeach projection is less accurate The results are good enough to beused as a basis for further optimization based refinement methods

transform of the 3D reconstructed volume Our 2D FFTaccelerator was used to calculate the 2D FFTs of the projec-tions as well as for initial stages of the 3D FFTwhich was thencompleted on the host PC Similar to the 2D FFT the 3D FFTis separable and can be divided into 2D FFTs and 1D FFTsThe results have been shown in Table 5 It can be seen thatthe improvement for smaller size densities is not significantbecause their FFTs are quite fast on general-purpose CPUsHowever for larger densities the FFT accelerator can give asignificant improvement If the remaining components arealso implemented on an FPGA significant speed increasecan be achieved Results of a thin slice from a 3D simulatedShepp-Logan [33] phantom have been shown in Figure 11 Itcan be seen that the results from the hardware acceleratedFBP are of slightly lower quality This is due to the factthat our 2D FFT implementation is less accurate (16 bitfixed-point) as compared to the CPU-based implementation(FFTW double-precision floating) The accelerated FBP wasalso tested with real Electron Tomography (ET) data

8 Conclusion

2D FFTs often become a major bottleneck for high-performance imaging and vision systems The inherentcomputational complexity of the 2D FFT kernel is fur-ther enhanced if effective removal (using PSD) of spuriousartifacts introduced by the nonperiodic nature of real-lifeimages is taken into accountWe developed and implementedan FPGA-based design for calculating high-throughput 2DDFTs with simultaneous edge artifact removal Our approachis based on a PSD algorithm that splits the frequency domainof a 2D image into a smooth component which contains thehigh-frequency cross-shaped artifacts and can be subtractedfrom the 2D DFT of the original image to obtain a periodiccomponent that is artifact-free Since this approach calculatestwo 2D DFTs simultaneously external memory addressingand repeated 1D FFT invocations become problematic Tosolve this problem we optimized the original PSD algorithmto reduce the number of DFT samples to be computed andDRAM access Moreover to reduce strided access from theDRAMduring column-wise reads we presented and analyzedldquotile-hoppingrdquo a memorymapping scheme which reduces thenumber of DRAM row activations when reading a singlecolumn of data This memory mapping scheme is generaland may be used for a variety of other applications Wedemonstrate that ldquotile-hoppingrdquomemorymapping can reducethe DRAM energy consumption by 529 Moreover weshow that the proposed optimizations lead to 28times less energyconsumption for the overall 2D FFT with EAR architecture

Our methods were tested using extensive synthesis andbenchmarking using a Xilinx Kintex 7 FPGA communicatingwith a host PC on a high-speed PXIe bus Our system isexpandable to support several FPGAs and can be adaptedto various large-scale computer vision and biomedical appli-cations Despite decomposing the image into periodic andsmooth frequency components our design requires lessrun time compared to traditional FPGA-based 2D DFTimplementation approaches and can be used for a varietyof highly demanding applications One such applicationfiltered back-projection was accelerated using the proposedimplementation to achieve better results specifically for largersize raw tomographic data

Conflicts of Interest

The authors declare that they have no conflicts of interest

16 International Journal of Reconfigurable Computing

Acknowledgments

Thisworkwas supported by JapaneseGovernmentOIST Sub-sidy for Operations (Ulf Skoglund) Grant no 5020S7010020FaisalMahmood andMart Toots were additionally supportedby the OIST PhD Fellowship The authors would like tothank National Instruments Research for their technicalsupport during the design process The authors would alsolike to thank Dr Steven D Aird for language assistance andShizuka Kuda for logistical arrangements

Supplementary Materials

File 2D FFT-EAR-Demomp4 video demonstrating theFPGA output of 2D FFT with edge artifact removal(Supplementary Materials)

References

[1] R N Bracewell The fourier transform and iis applications vol5 New York NY USA 1965

[2] Theory and application of digital signal processing vol 1Prentice-Hall Inc Englewood Cliffs NJ USA 1975

[3] R A Brooks and G Di Chiro ldquoTheory of image reconstructionin computed tomographyrdquo Radiology vol 117 no 3 I pp 561ndash572 1975

[4] L Moisan ldquoPeriodic plus smooth image decompositionrdquo Jour-nal of Mathematical Imaging and Vision vol 39 no 2 pp 161ndash179 2011

[5] B Akin P A Milder F Franchetti and J C Hoe ldquoMemorybandwidth efficient two-dimensional fast Fourier transformalgorithm and implementation for large problem sizesrdquo inProceedings of the 20th IEEE International Symposium on Field-Programmable Custom Computing Machines FCCM 2012 pp188ndash191 Canada May 2012

[6] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[7] H Kee N Petersen J Kornerup and S S BhattacharyyaldquoSystematic generation of FPGA-based FFT implementationsrdquoin Proceedings of the 2008 IEEE International Conference onAcoustics Speech and Signal Processing ICASSP pp 1413ndash1416USA April 2008

[8] F Mahmood M Toots L-G Ofverstedt and U Skoglundldquo2DDiscrete Fourier Transformwith simultaneous edge artifactremoval for real-Time applicationsrdquo in Proceedings of the Inter-national Conference on Field Programmable Technology FPT2015 pp 236ndash239 New Zealand December 2015

[9] M Frigo and S G Johnson ldquoFFTW an adaptive software archi-tecture for the FFTrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing vol 3 pp1381ndash1384 IEEE May 1998

[10] M Puschel J M F Moura J R Johnson et al ldquoSPIRAL codegeneration for DSP transformsrdquo Proceedings of the IEEE vol 93no 2 pp 232ndash273 2005

[11] EWang Q Zhang B Shen et alHigh-Performance Computingon the Intel Xeon Phi Springer International Publishing 2014

[12] B Akin F Franchetti and J C Hoe ldquoFFTS with near-optimalmemory access through block data layoutsrdquo in Proceedings ofthe 2014 IEEE International Conference onAcoustics Speech andSignal Processing ICASSP 2014 pp 3898ndash3902 Italy May 2014

[13] B Akin F Franchetti and J C Hoe ldquoUnderstanding thedesign space of DRAM-optimized hardware FFT acceleratorsrdquoin Proceedings of the 25th IEEE International Conference onApplication-Specific Systems Architectures and Processors ASAP2014 pp 248ndash255 Switzerland June 2014

[14] C-L Yu K Irick C Chakrabarti and V Narayanan ldquoMul-tidimensional DFT IP generator for FPGA platformsrdquo IEEETransactions on Circuits and Systems I Regular Papers vol 58no 4 pp 755ndash764 2011

[15] DHe andQ Sun ldquoApractical print-scan resilientwatermarkingschemerdquo in Proceedings of the IEEE International Conference onImage Processing 2005 ICIP 2005 pp 257ndash260 Italy September2005

[16] D G Bailey Design for embedded image processing on FPGAsJohn Wiley Sons 2011

[17] B G Batchelor ldquoImplementing Machine Vision Systems UsingFPGAsrdquo in Machine Vision Handbook 1136 p 1103 SpringerLondon UK 2012

[18] T Lenart M Gustafsson and V Owall ldquoA hardware accelera-tion platform for digital holographic imagingrdquo Journal of SignalProcessing Systems vol 52 no 3 pp 297ndash311 2008

[19] G H Loh ldquo3D-Stacked Memory Architectures for Multi-coreProcessorsrdquo ACM SIGARCH Computer Architecture News vol36 no 3 pp 453ndash464 2008

[20] Z Qiuling B Akin H E Sumbul et al ldquoA 3D-stacked logic-in-memory accelerator for application-specific data intensivecomputingrdquo in Proceedings of the IEEE International 3D SystemsIntegration Conference pp 1ndash7 October 2013

[21] I S Uzun A Amira and A Bouridane ldquoFPGA implementa-tions of fast Fourier transforms for real-time signal and imageprocessingrdquo pp 283ndash296

[22] T Dillon ldquoTwo Virtex-II FPGAs deliver fastest cheapest besthigh-performance image processing systemrdquo rdquo Xilinx XcellJournal vol 41 pp 70ndash73 2001

[23] A Hast ldquoRobust and invariant phase based local featurematchingrdquo in Proceedings of the 22nd International Conferenceon Pattern Recognition ICPR 2014 pp 809ndash814 Sweden August2014

[24] B Galerne Y Gousseau and J-M Morel ldquoRandom phasetextures theory and synthesisrdquo IEEE Transactions on ImageProcessing vol 20 no 1 pp 257ndash267 2011

[25] R Hovden Y Jiang H L Xin and L F Kourkoutis ldquoPeriodicArtifact Reduction in Fourier Transforms of Full Field AtomicResolution Imagesrdquo Microscopy and Microanalysis vol 21 no2 pp 436ndash441 2014

[26] D G Bailey ldquoThe advantages and limitations of high levelsynthesis for FPGA based image processingrdquo in Proceedings ofthe the 9th International Conference pp 134ndash139 Seville SpainSeptember 2015

[27] W Wang B Duan C Zhang P Zhang and N Sun ldquoAcceler-ating 2D FFT with non-power-of-two problem size on FPGArdquoin Proceedings of the 2010 International Conference on Recon-figurable Computing and FPGAs ReConFig 2010 pp 208ndash213Mexico December 2010

[28] F NattererTheMathematics of Computerized Tomography JohnWiley amp Sons 1986

[29] R A Crowther D J DeRosier and A Klug ldquoThe Reconstruc-tion of aThree-Dimensional Structure from Projections and itsApplication to Electron Microscopyrdquo Proceedings of the RoyalSociety A Mathematical Physical and Engineering Sciences vol317 no 1530 pp 319ndash340 1970

International Journal of Reconfigurable Computing 17

[30] U Skoglund L-G Ofverstedt R M Burnett and G BricogneldquoMaximum-entropy three-dimensional reconstruction withdeconvolution of the contrast transfer function A test applica-tion with adenovirusrdquo Journal of Structural Biology vol 117 no3 pp 173ndash188 1996

[31] F Mahmood N Shahid P Vandergheynst and U SkoglundldquoGraph-based sinogram denoising for tomographic reconstruc-tionsrdquo in Proceedings of the 38th Annual International Confer-ence of the IEEE Engineering in Medicine and Biology SocietyEMBC 2016 pp 3961ndash3964 USA August 2016

[32] F Mahmood N Shahid U Skoglund and P VandergheynstldquoAdaptive Graph-Based Total Variation for TomographicReconstructionsrdquo IEEE Signal Processing Letters vol 25 no 5pp 700ndash704 2018

[33] L A Shepp and B F Logan Jr ldquoThe Fourier reconstruction of ahead sectionrdquo IEEE Transactions on Nuclear Science vol 21 no3 pp 21ndash43 1974

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 16: Algorithm and Architecture Optimization for 2D Discrete ...downloads.hindawi.com/journals/ijrc/2018/1403181.pdf · Algorithm and Architecture Optimization for 2D Discrete Fourier

16 International Journal of Reconfigurable Computing

Acknowledgments

Thisworkwas supported by JapaneseGovernmentOIST Sub-sidy for Operations (Ulf Skoglund) Grant no 5020S7010020FaisalMahmood andMart Toots were additionally supportedby the OIST PhD Fellowship The authors would like tothank National Instruments Research for their technicalsupport during the design process The authors would alsolike to thank Dr Steven D Aird for language assistance andShizuka Kuda for logistical arrangements

Supplementary Materials

File 2D FFT-EAR-Demomp4 video demonstrating theFPGA output of 2D FFT with edge artifact removal(Supplementary Materials)

References

[1] R N Bracewell The fourier transform and iis applications vol5 New York NY USA 1965

[2] Theory and application of digital signal processing vol 1Prentice-Hall Inc Englewood Cliffs NJ USA 1975

[3] R A Brooks and G Di Chiro ldquoTheory of image reconstructionin computed tomographyrdquo Radiology vol 117 no 3 I pp 561ndash572 1975

[4] L Moisan ldquoPeriodic plus smooth image decompositionrdquo Jour-nal of Mathematical Imaging and Vision vol 39 no 2 pp 161ndash179 2011

[5] B Akin P A Milder F Franchetti and J C Hoe ldquoMemorybandwidth efficient two-dimensional fast Fourier transformalgorithm and implementation for large problem sizesrdquo inProceedings of the 20th IEEE International Symposium on Field-Programmable Custom Computing Machines FCCM 2012 pp188ndash191 Canada May 2012

[6] J W Cooley and J W Tukey ldquoAn algorithm for the machinecalculation of complex Fourier seriesrdquoMathematics of Compu-tation vol 19 no 90 pp 297ndash301 1965

[7] H Kee N Petersen J Kornerup and S S BhattacharyyaldquoSystematic generation of FPGA-based FFT implementationsrdquoin Proceedings of the 2008 IEEE International Conference onAcoustics Speech and Signal Processing ICASSP pp 1413ndash1416USA April 2008

[8] F Mahmood M Toots L-G Ofverstedt and U Skoglundldquo2DDiscrete Fourier Transformwith simultaneous edge artifactremoval for real-Time applicationsrdquo in Proceedings of the Inter-national Conference on Field Programmable Technology FPT2015 pp 236ndash239 New Zealand December 2015

[9] M Frigo and S G Johnson ldquoFFTW an adaptive software archi-tecture for the FFTrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing vol 3 pp1381ndash1384 IEEE May 1998

[10] M Puschel J M F Moura J R Johnson et al ldquoSPIRAL codegeneration for DSP transformsrdquo Proceedings of the IEEE vol 93no 2 pp 232ndash273 2005

[11] EWang Q Zhang B Shen et alHigh-Performance Computingon the Intel Xeon Phi Springer International Publishing 2014

[12] B Akin F Franchetti and J C Hoe ldquoFFTS with near-optimalmemory access through block data layoutsrdquo in Proceedings ofthe 2014 IEEE International Conference onAcoustics Speech andSignal Processing ICASSP 2014 pp 3898ndash3902 Italy May 2014

[13] B Akin F Franchetti and J C Hoe ldquoUnderstanding thedesign space of DRAM-optimized hardware FFT acceleratorsrdquoin Proceedings of the 25th IEEE International Conference onApplication-Specific Systems Architectures and Processors ASAP2014 pp 248ndash255 Switzerland June 2014

[14] C-L Yu K Irick C Chakrabarti and V Narayanan ldquoMul-tidimensional DFT IP generator for FPGA platformsrdquo IEEETransactions on Circuits and Systems I Regular Papers vol 58no 4 pp 755ndash764 2011

[15] DHe andQ Sun ldquoApractical print-scan resilientwatermarkingschemerdquo in Proceedings of the IEEE International Conference onImage Processing 2005 ICIP 2005 pp 257ndash260 Italy September2005

[16] D G Bailey Design for embedded image processing on FPGAsJohn Wiley Sons 2011

[17] B G Batchelor ldquoImplementing Machine Vision Systems UsingFPGAsrdquo in Machine Vision Handbook 1136 p 1103 SpringerLondon UK 2012

[18] T Lenart M Gustafsson and V Owall ldquoA hardware accelera-tion platform for digital holographic imagingrdquo Journal of SignalProcessing Systems vol 52 no 3 pp 297ndash311 2008

[19] G H Loh ldquo3D-Stacked Memory Architectures for Multi-coreProcessorsrdquo ACM SIGARCH Computer Architecture News vol36 no 3 pp 453ndash464 2008

[20] Z Qiuling B Akin H E Sumbul et al ldquoA 3D-stacked logic-in-memory accelerator for application-specific data intensivecomputingrdquo in Proceedings of the IEEE International 3D SystemsIntegration Conference pp 1ndash7 October 2013

[21] I S Uzun A Amira and A Bouridane ldquoFPGA implementa-tions of fast Fourier transforms for real-time signal and imageprocessingrdquo pp 283ndash296

[22] T Dillon ldquoTwo Virtex-II FPGAs deliver fastest cheapest besthigh-performance image processing systemrdquo rdquo Xilinx XcellJournal vol 41 pp 70ndash73 2001

[23] A Hast ldquoRobust and invariant phase based local featurematchingrdquo in Proceedings of the 22nd International Conferenceon Pattern Recognition ICPR 2014 pp 809ndash814 Sweden August2014

[24] B Galerne Y Gousseau and J-M Morel ldquoRandom phasetextures theory and synthesisrdquo IEEE Transactions on ImageProcessing vol 20 no 1 pp 257ndash267 2011

[25] R Hovden Y Jiang H L Xin and L F Kourkoutis ldquoPeriodicArtifact Reduction in Fourier Transforms of Full Field AtomicResolution Imagesrdquo Microscopy and Microanalysis vol 21 no2 pp 436ndash441 2014

[26] D G Bailey ldquoThe advantages and limitations of high levelsynthesis for FPGA based image processingrdquo in Proceedings ofthe the 9th International Conference pp 134ndash139 Seville SpainSeptember 2015

[27] W Wang B Duan C Zhang P Zhang and N Sun ldquoAcceler-ating 2D FFT with non-power-of-two problem size on FPGArdquoin Proceedings of the 2010 International Conference on Recon-figurable Computing and FPGAs ReConFig 2010 pp 208ndash213Mexico December 2010

[28] F NattererTheMathematics of Computerized Tomography JohnWiley amp Sons 1986

[29] R A Crowther D J DeRosier and A Klug ldquoThe Reconstruc-tion of aThree-Dimensional Structure from Projections and itsApplication to Electron Microscopyrdquo Proceedings of the RoyalSociety A Mathematical Physical and Engineering Sciences vol317 no 1530 pp 319ndash340 1970

International Journal of Reconfigurable Computing 17

[30] U Skoglund L-G Ofverstedt R M Burnett and G BricogneldquoMaximum-entropy three-dimensional reconstruction withdeconvolution of the contrast transfer function A test applica-tion with adenovirusrdquo Journal of Structural Biology vol 117 no3 pp 173ndash188 1996

[31] F Mahmood N Shahid P Vandergheynst and U SkoglundldquoGraph-based sinogram denoising for tomographic reconstruc-tionsrdquo in Proceedings of the 38th Annual International Confer-ence of the IEEE Engineering in Medicine and Biology SocietyEMBC 2016 pp 3961ndash3964 USA August 2016

[32] F Mahmood N Shahid U Skoglund and P VandergheynstldquoAdaptive Graph-Based Total Variation for TomographicReconstructionsrdquo IEEE Signal Processing Letters vol 25 no 5pp 700ndash704 2018

[33] L A Shepp and B F Logan Jr ldquoThe Fourier reconstruction of ahead sectionrdquo IEEE Transactions on Nuclear Science vol 21 no3 pp 21ndash43 1974

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 17: Algorithm and Architecture Optimization for 2D Discrete ...downloads.hindawi.com/journals/ijrc/2018/1403181.pdf · Algorithm and Architecture Optimization for 2D Discrete Fourier

International Journal of Reconfigurable Computing 17

[30] U Skoglund L-G Ofverstedt R M Burnett and G BricogneldquoMaximum-entropy three-dimensional reconstruction withdeconvolution of the contrast transfer function A test applica-tion with adenovirusrdquo Journal of Structural Biology vol 117 no3 pp 173ndash188 1996

[31] F Mahmood N Shahid P Vandergheynst and U SkoglundldquoGraph-based sinogram denoising for tomographic reconstruc-tionsrdquo in Proceedings of the 38th Annual International Confer-ence of the IEEE Engineering in Medicine and Biology SocietyEMBC 2016 pp 3961ndash3964 USA August 2016

[32] F Mahmood N Shahid U Skoglund and P VandergheynstldquoAdaptive Graph-Based Total Variation for TomographicReconstructionsrdquo IEEE Signal Processing Letters vol 25 no 5pp 700ndash704 2018

[33] L A Shepp and B F Logan Jr ldquoThe Fourier reconstruction of ahead sectionrdquo IEEE Transactions on Nuclear Science vol 21 no3 pp 21ndash43 1974

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 18: Algorithm and Architecture Optimization for 2D Discrete ...downloads.hindawi.com/journals/ijrc/2018/1403181.pdf · Algorithm and Architecture Optimization for 2D Discrete Fourier

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom