14
RAPID PROTOTYPING Framework for FPGA-based discrete biorthogonal wavelet transforms implementation I.S. Uzun and A. Amira Abstract: The discrete wavelet transform has taken its place at the forefront of research for the development of signal and image processing applications. These wavelet-based approaches have outperformed existing strategies in many areas including telecommunication, numerical analysis and, most notably, image/video compression. The authors present an investigation into the design and implementation of 1-D and 2-D discrete biorthogonal wavelet transforms (DBWTs) using a field programmable gate array (FPGA)-based rapid prototyping environment. The proposed architectures for DBWTs are scalable, modular and have less area and time complexity when compared with existing structures. FPGA implementation results based on a Xilinx Virtex- 2000E device have shown that the proposed system provides an efficient solution for the processing of DBWTs in real-time. 1 Introduction In the last decade, discrete wavelet transforms (DWTs) have become powerful tools in a wide range of applications including image/video processing, numerical analysis and telecommunication. The advantage of DWT over existing transforms, such as discrete Fourier transform (DFT) and discrete cosine transform (DCT), is that the DWT performs a multiresolution analysis of a signal with localisation in both time and frequency [1, 2]. The historical aspect of the development of wavelet transforms has been elucidated in two important texts by Daubechies [3] and Graps [4]. The present state of wavelet transform research is a result of the input received from many diverse areas of science. In 1992, Cohen et al. [5] established the theory of biorthogonal wavelet systems. Biorthogonal wavelets have been found to offer improved coding gain and an efficient treatment of boundaries in image coding applications [6, 7]. A significant difference between the orthonormal and the biorthogonal wavelet transform lies in the quadrature mirror filter (QMF) relationship. The high-pass filter coeffi- cients in the orthonormal filters are the QMF of the low-pass filter. Biorthogonal filters lack this property. Thus the archi- tectural simplifications possible in orthonormal filters, that is, lattice structure and so on [8, 9], are not possible in biorthogonal wavelets. Because of the demand for real-time wavelet processing in applications such as video compression, Internet com- munications compression, object recognition and numerical analysis, many architectures for DWT have been proposed [8, 10–13]. Most of the effort towards the design and hardware implementation (in the form of very large scale integration (VLSI) and FPGA) of wavelet transforms has been concentrated on the orthonormal wavelet family. One of the main reasons for this is that orthonormal wave- lets were the first functions to be implemented in the form of filter banks, whereas biorthogonal wavelet functions are relatively new. Secondly, the properties of biorthogonal wavelet functions are much more diverse, which compli- cates the development of their generic architecture. A good survey of the different schemes used for the develop- ment of DWT architectures can be found in a recent paper by Weeks and Bayoumi [14]. Previously reported 1-D discrete biorthogonal wavelet transform (DBWT) works include pipelined and pyramid algorithm (PA)-based VLSI architectures proposed in [15, 16]. The pipelined architecture has hardware complexity proportional to the number of decomposition levels (J ), and it has a period of N 0 clock cycles to compute the DBWT of a sequence x having N 0 samples. The PA-based design has been targeted to have a low hardware complexity, but it requires O(JN 0 ) clock cycles for the computation. The authors also presented some FPGA implementations for the pipelined architecture. Since the designs have been captured using behavioural description, FPGA implementations failed to provide efficient results. In [17], Nibouche and Nibouche presented FPGA implementations of bit-level and distributed arithmetic (DA)-based DBWT architectures in order to minimise area requirements, but they have a computation time which is proportional to N 0 , input data wordlength (W i ) and number of DBWT levels (J ), O(N 0 W i J ). Recently, Jou et al. [18] proposed a VLSI implementation of a DBWT architecture (operating at 50 MHz) that can perform only the first-level decomposition in N 0 /2 clock cycles, but this design cannot be scaled to higher levels of wavelet decomposition which is required in most of the applications. Although there are a vast amount of 2-D DWT architec- tures in the literature, a limited number of 2-D DBWT archi- tectures based on a recursive pyramid algorithm (RPA), or its modified versions, have been proposed [19–21]. When the 2-D wavelet basis functions are separable, the 2-D DWT # The Institution of Engineering and Technology 2006 IEE Proceedings online no. 20045080 doi:10.1049/ip-vis:20045080 Paper first received 30th June 2004 and in revised form 6th March 2005 I.S. Uzun is with the School of Electronics, Electrical and Computer Science, Queen’s University Belfast, Belfast, UK A. Amira is with the School of Engineering and Design, Brunel University, Uxbridge, Middlesex UB8 3PH, UK E-mail: [email protected] IEE Proc.-Vis. Image Signal Process., Vol. 153, No. 6, December 2006 721

RAPID PROTOTYPING - Framework for FPGA-based discrete biorthogonal wavelet transforms implementation

  • Upload
    a

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: RAPID PROTOTYPING - Framework for FPGA-based discrete biorthogonal wavelet transforms implementation

RAPID PROTOTYPING

Framework for FPGA-based discrete biorthogonalwavelet transforms implementation

I.S. Uzun and A. Amira

Abstract: The discrete wavelet transform has taken its place at the forefront of research for thedevelopment of signal and image processing applications. These wavelet-based approaches haveoutperformed existing strategies in many areas including telecommunication, numerical analysisand, most notably, image/video compression. The authors present an investigation into thedesign and implementation of 1-D and 2-D discrete biorthogonal wavelet transforms (DBWTs)using a field programmable gate array (FPGA)-based rapid prototyping environment. The proposedarchitectures for DBWTs are scalable, modular and have less area and time complexity whencompared with existing structures. FPGA implementation results based on a Xilinx Virtex-2000E device have shown that the proposed system provides an efficient solution for the processingof DBWTs in real-time.

1 Introduction

In the last decade, discrete wavelet transforms (DWTs) havebecome powerful tools in a wide range of applicationsincluding image/video processing, numerical analysis andtelecommunication. The advantage of DWT over existingtransforms, such as discrete Fourier transform (DFT) anddiscrete cosine transform (DCT), is that the DWT performsa multiresolution analysis of a signal with localisation inboth time and frequency [1, 2]. The historical aspect ofthe development of wavelet transforms has been elucidatedin two important texts by Daubechies [3] and Graps [4]. Thepresent state of wavelet transform research is a result of theinput received from many diverse areas of science.

In 1992, Cohen et al. [5] established the theory ofbiorthogonal wavelet systems. Biorthogonal wavelets havebeen found to offer improved coding gain and an efficienttreatment of boundaries in image coding applications [6, 7].A significant difference between the orthonormal and thebiorthogonal wavelet transform lies in the quadraturemirror filter (QMF) relationship. The high-pass filter coeffi-cients in the orthonormal filters are the QMF of the low-passfilter. Biorthogonal filters lack this property. Thus the archi-tectural simplifications possible in orthonormal filters, thatis, lattice structure and so on [8, 9], are not possible inbiorthogonal wavelets.

Because of the demand for real-time wavelet processingin applications such as video compression, Internet com-munications compression, object recognition and numericalanalysis, many architectures for DWT have been proposed[8, 10–13]. Most of the effort towards the design and

# The Institution of Engineering and Technology 2006

IEE Proceedings online no. 20045080

doi:10.1049/ip-vis:20045080

Paper first received 30th June 2004 and in revised form 6th March 2005

I.S. Uzun is with the School of Electronics, Electrical and Computer Science,Queen’s University Belfast, Belfast, UK

A. Amira is with the School of Engineering and Design, Brunel University,Uxbridge, Middlesex UB8 3PH, UK

E-mail: [email protected]

IEE Proc.-Vis. Image Signal Process., Vol. 153, No. 6, December 2006

hardware implementation (in the form of very large scaleintegration (VLSI) and FPGA) of wavelet transforms hasbeen concentrated on the orthonormal wavelet family.One of the main reasons for this is that orthonormal wave-lets were the first functions to be implemented in the form offilter banks, whereas biorthogonal wavelet functions arerelatively new. Secondly, the properties of biorthogonalwavelet functions are much more diverse, which compli-cates the development of their generic architecture. Agood survey of the different schemes used for the develop-ment of DWT architectures can be found in a recent paperby Weeks and Bayoumi [14].

Previously reported 1-D discrete biorthogonal wavelettransform (DBWT) works include pipelined and pyramidalgorithm (PA)-based VLSI architectures proposed in [15,16]. The pipelined architecture has hardware complexityproportional to the number of decomposition levels (J ),and it has a period of N0 clock cycles to compute theDBWT of a sequence x having N0 samples. The PA-baseddesign has been targeted to have a low hardware complexity,but it requires O(JN0) clock cycles for the computation. Theauthors also presented some FPGA implementations for thepipelined architecture. Since the designs have been capturedusing behavioural description, FPGA implementationsfailed to provide efficient results. In [17], Nibouche andNibouche presented FPGA implementations of bit-leveland distributed arithmetic (DA)-based DBWT architecturesin order to minimise area requirements, but they have acomputation time which is proportional to N0, input datawordlength (Wi) and number of DBWT levels (J ),O(N0WiJ ). Recently, Jou et al. [18] proposed a VLSIimplementation of a DBWT architecture (operating at50 MHz) that can perform only the first-level decompositionin N0/2 clock cycles, but this design cannot be scaled tohigher levels of wavelet decomposition which is requiredin most of the applications.

Although there are a vast amount of 2-D DWT architec-tures in the literature, a limited number of 2-D DBWT archi-tectures based on a recursive pyramid algorithm (RPA), or itsmodified versions, have been proposed [19–21]. When the2-D wavelet basis functions are separable, the 2-D DWT

721

Page 2: RAPID PROTOTYPING - Framework for FPGA-based discrete biorthogonal wavelet transforms implementation

can be split into row-wise and column-wise 1-D operations.Although this approach produces predictable solutions,these architectures do not incorporate many aspects of2-D processing. The non-separable 2-D DWT does not dothe row and column transforms, but instead computes the2-D DWT directly decomposing the input image into twodimensions. Only a few architectures [22, 23] and just ahardware implementation [24] have been proposed for2-D non-separable approach.

The systems, known as custom computing machines(CCMs), use a FPGA to provide hardware for the efficientcomputation of the intensive, parallel portions of thealgorithm while leaving the remaining code to be executedon the host processor. Several such CCMs have shown thatthis type of system can provide more than ten times betterperformance than standard microprocessors when addres-sing a specific problem [25–27]. As integration levelsgrow, the potential for providing parallelism with the pro-grammable fabric will realise performance levels severalorders of magnitude higher than those possible with micro-processors or digital signal processors (DSPs). Another keyfeature of FPGAs is their flexibility, which makes themattractive in many real system implementations [27, 28].However, users must program FPGAs at a very low leveland have a detailed knowledge of the architecture of thedevice being used. They do not therefore facilitate easydevelopment of, or experimentation with, signal/image pro-cessing algorithms.

The main objectives of the research work presented inthis paper can be described as follows:

† Developing efficient and scalable VLSI architectures for1-D and 2-D biorthogonal wavelet transforms, where botharea and speed can be estimated with specific designparameters.† Developing a library of biorthogonal wavelet transformstargeting FPGAs, which can be extended for other types ofwavelets.† Developing a high-level framework to try to reconcilethe dual requirements of high performance and ease ofdevelopment by enabling the system designers to experimentconveniently with different wavelet filters to investigate thebest area/speed trade-offs, rather than concentrating onthe low-level and complex structure of FPGAs.

2 Proposed system

This section is concerned with the description of the pro-posed high-level framework environment for FPGA-basedDBWTs implementations.

2.1 Configurable computing

Reconfigurable hardware, usually in the form ofFPGAs, has been touted as a new and better means ofperforming high performance computing. Reconfigurablecomputing systems are those computing platforms whosearchitecture can be modified by software to suit theapplication at hand. Reconfigurable computing involvesmanipulation of the logic within the FPGA at run-time. Inother words, the design of the hardware may change inresponse to the demands placed upon the system whileit is running. Reconfigurable computing has severaladvantages [27].

† Possibility to achieve greater functionality with a simplerhardware design;† Lower system cost, which does not manifest itselfexactly as you might expect and

722

† Reduced time-to-market.

Typically, FPGA structures provide a reconfigurable hard-ware with flexible interconnections, with field-programmableability, which are widely used for rapid prototyping of DSPand computer systems. Furthermore, the recent advances inIC processing technology and innovations in their architec-tures have made FPGAs highly suitable alternatives todesign powerful computing platforms [27].

2.2 Environment in detail

The proposed system for mapping the DBWTs on the FPGAas shown in Fig. 1 consists of:

† Graphical user interface (GUI): The GUI supportsexperimentation with different parameters to enable theuser to explore system performance, for example, speedand area. The input parameters required for the generationof design files include:

– the DBWT dimension (1-D or 2-D);– the DBWT architecture type;– the DBWT filter length (e.g. 1-D 9-tap, 2-D 9/7 pair);– the transform length (N );– the input and output data wordlength (Wi and Wo) and– the coefficient wavelength (Wc).

† DBWT library: The library includes the architecture for1-D and 2-D DBWTs. The application has the ability tochoose and download existing files and to generate newfiles and save them.† Generator: The generator automatically downloads thenecessary modules and then generates the top-level designfiles given the user selected parameters and settings.† FPGA coprocessor: Celoxica’s RC1000 FPGA-baseddevelopment board is based on the Xilinx XCV2000E ofthe Virtex-E family. The external SRAM memories areconnected to the FPGA in four 32-bit wide memory banks.The memory is also visible to the host CPU across the PCIbus. Each of the four banks may be granted to either thehost CPU or the FPGA at any one time. It is then accessibleto the FPGA directly and the host CPU either by DMAtransfers across PCI bus or simply as virtual address.

It is important to note here that although the target hardwarein this work is a RC1000 board with Xilinx XCV2000EVirtex FPGA, the architecture designs are completely porta-ble and can be implemented on any type of FPGA chip withthe use of the proposed system.

3 1-D DBWT architectures

3.1 1-D DBWT: mathematical review

A wavelet transform breaks a signal into shifts and trans-lations of a basis function called the ‘mother wavelet’.This is mathematically represented in (1). The directimplementation of this equation is computationally veryintensive as the time shift and the scaling factor a canassume any real value

CWTxðt; aÞ ¼1ffiffiffiap

ðxðtÞ � h �

t � t

a

� �dt ð1Þ

The work by Mallat [2] and Daubechies [1] led to the discretefilter-based interpretation of wavelets. Through this, wave-lets can be implemented as a set of filter banks comprisinga high-pass and a low-pass filter, each followed by down-sampling by 2. The low-pass filtered and decimated outputaj(n) having N j ¼ N/2 j samples is recursively passed

IEE Proc.-Vis. Image Signal Process., Vol. 153, No. 6, December 2006

Page 3: RAPID PROTOTYPING - Framework for FPGA-based discrete biorthogonal wavelet transforms implementation

GUI Generator

Design parameters

- Transform dimension(1-D/2-D)

- Wavelet Type (e.g. 9/7)- Transform size ( N)- Input/Output dataword-length ( W

i and Wo)

- Word-length of the coefficients(W

c )- Number of PEs ( p)- FPGA device type

Parametrisable Handel-C/VHDL/Verilog code

CeloxicaDK3 Suite

FPGAP&RTool

VH

DL

/V

erilog

Handel-C

Edif

Virtex-2000E(BG560

Package)

SRAM

I/O

HostMachine

SRAM

SRAM

SRAM

SRAM

System: Host-Coprocessor

FPGA Configuration Files

Matrix Transforms Library

1-D / 2-D

- Arc1D-I (Pipelined)- Arc1D-II (RPA based)

- Arc2D-I (Separable)- Arc2D-II (Non-Separable)

Discrete Biorthogonal Wavelet Transform

FPGASynthesis

Tools

Files

Edif

Fig. 1 Rapid prototyping environment for discrete biorthogonal wavelet transforms

through similar filter banks to add the dimension of varyingresolution at every stage. This is mathematically expressed in(2) and schematically shown in Fig. 2

a jðnÞ ¼XL�1

i¼0

lðiÞ � a j�1ð2n� iÞ; 0 � n , Nj ð2aÞ

d jðnÞ ¼XL�1

i¼0

hðiÞ � d j�1ð2n� iÞ; 0 � n , Nj ð2bÞ

The coefficients aj(n) and dj(n) refer to approximation anddetailed components in the signal at decomposition level j,respectively. The l(i) and h(i) represent the coefficients,respectively, of low-pass and high-pass L-tap filters.The following property is a direct consequence of thedecimation by 2 in (2).

Property 1: Let a jeven(n), leven(n) and heven(n) be the even-

numbered samples of a j(n) (at level j), l and h, respectively.Also, let aj

odd(n), lodd(n) and hodd(n) be the odd-numberedsamples of a j(n), l and h, respectively. Therefore ajþ1 and

IEE Proc.-Vis. Image Signal Process., Vol. 153, No. 6, December 2006

d jþ1 can be defined as

a jþ1ðnÞ ¼XdL=2e�1

i¼0

levenðiÞajevenðn� iÞ

þXL�dL=2e�1

i¼0

loddðiÞajoddðn� iÞ ð3aÞ

d jþ1ðnÞ ¼XdL=2e�1

i¼0

hevenðiÞdjevenðn� iÞ

þXL�dL=2e�1

i¼0

hoddðiÞdjoddðn� iÞ ð3bÞ

where 1 � j � J.Property 1 means that the decimated wavelet transform

coefficients can be directly computed by a point-sum ofconvolutions on even-numbered and odd-numbered samplesof input data with even-numbered and odd-numbered filtercoefficients, respectively.

Low-pass

High-pass

Low-pass

High-pass

Low-pass

High-pass

First-Level Second-Level Third-Level

ApproximateSignal (a )3

Detail 1 (d )

Detail 2 (d )

1

2x(n)

Detail 3 (d )32

2

2

2

2

2

Fig. 2 Three-level wavelet decomposition system

723

Page 4: RAPID PROTOTYPING - Framework for FPGA-based discrete biorthogonal wavelet transforms implementation

3.2 Derivation of 1-D DBWT

In this paper, biorthogonal wavelet filters have been con-sidered. These filters are very attractive for implementingpyramidal structures since they do not require phase com-pensation decomposition levels. Biorthogonal waveletfilters possess a linear-phase property and they have a sym-metric (or anti-symmetric) impulse response. Their filtercoefficients can thus be written as follows

lðnÞ ¼+lðL� 1� nÞ; n ¼ 0; 1; . . . ;L

2

� �� 1 ð4Þ

where L is the filter length and d e represents maximum integer.For the sake of illustration, the case of a (9-tap) 1-D

biorthogonal filter (L ¼ 9) will be considered in the restof the paper. For a (9)-tap biorthogonal filter, the symmetryalong with the filter coefficients can be written as

lð4� nÞ ¼+lð4þ nÞ; n ¼ 0; 1; . . . ; 4 ð5Þ

By taking into account the symmetry given in (4), a bio-rthogonalised version of DWT given in (2a) can bedescribed as follows

ajðnÞ ¼ lð4Þ � aj�1ð2n� 4Þ

þX3

i¼0

fa j�1ð2n� 8� iÞ þ aj�1ð2n� iÞg ð6Þ

where the case of filter length L ¼ 9 is considered.

3.3 Arc1D-I: a balanced pipelined architecture

The 1-D DBWT can be pipelined into J processing elementsPEj (1 � j � J ), where each PEj is responsible for the com-putation of the decomposition level j. The complexity of thedecomposition level j is linear with the number of inputsamples Nj. Because of the decimation by 2, the complexityin each decomposition level can be expressed as:

Cj ¼ Cjþ1 ð7Þ

Therefore in order to constitute a balanced pipelined 1-DDBWT architecture, each PEj should consist of Mj

multipliers where Mj ¼ 2 . Mjþ1.Since, PE1 uses M1 ¼ dL/2e which leads to design of a

PE1 having a period of N0/2 clock cycles with 100%efficiency, each PEj should have

Mk ¼L

2j

� �; j ¼ 1; 2; . . . ; J ð8Þ

A top-level architecture of pipelined 1-D DBWT is shownin Fig. 3.

Fig. 3 Arc1D-I: top-level architecture for 1-D DBWT

724

3.3.1 First level of decomposition, PE1: In order todesign a high-speed PE to perform the first level ofDBWT decomposition, Property 1 is used. According toProperty 1, the first-level decomposition (a1) is computedas the point-by-point sum of even and odd convolutions.

In order to process the even part (xe) and odd part (xo)of input sequence in parallel, two individual filters (Leven

and Lodd) are designed so that the first-level decomposition(a1) is performed at the rate of two samples per clock cycle.

Equation (5) has been exploited in the design of biortho-gonal filter units to reduce the VLSI area in terms of thenumber of multipliers and adders. Taking advantage ofthe coefficient symmetry, relevant inputs can be addedand connected to the multipliers before the actual multipli-cation is performed. The number of multipliers has beenreduced from L to L/2 for even order and to (Lþ 1)/2 forodd order. For illustration, the architecture of PE1 isshown in Fig. 4a for a 9-tap biorthogonal filter. This typeof architecture is suitable for wavelet filter pairs suchas biorthogonal 9, 3 or biorthogonal 9, 7 tap systems. Thefunctional analysis diagram for Leven and Lodd is alsoprovided in Fig. 4b.

The PE1 has two parallel input lines in order to feed even-numbered samples (xeven) into Leven and odd-numberedsamples (xodd) into Lodd in parallel during the same clockcycle (t ¼ n) at the rate of one sample per clock cycle.Therefore the first-level decomposition of N0 sample inputsequence x can be performed in only (N0/2) clock cycles.The outputs a1 are generated by point-by-point additionof the partial outputs from Leven and Lodd filters at the rateof one sample per clock cycle.

3.3.2 Second level of decomposition, PE2: Thesecond level of decomposition requires a second PE2 thathas to be pipelined to the output of PE1. As explained pre-viously, PE2 must have three multipliers in order to consti-tute a balanced team with PE1. Each multiplier performs thecomputations related to two coefficients of the filter.

A generic architecture has been designed by combiningthe symmetry property of biorthogonal filters with poly-phase decomposition. The architecture for 9-tap biortho-gonal filter is depicted in Fig. 5a.

As shown in the functional analysis diagram (Fig. 5b), theinput to PE2 is a sequential data stream, from PE1, at a rateof one sample per clock cycle. The computations areperiodic with 2. The periods are divided into two subperiodsidentified by a 1-bit select signal S. In the subperiod S ¼ 0,the multiplier Mk performs the multiplication of the inputsample by the filter coefficients h2k and adds these productsto the data stored in the buffer during the previoussubperiod. In the subperiod S ¼ 1, Mk performs the multi-plication of the input sample by the filter coefficientsh2kþ1 and stores the addition of these products into buffer.By this way, a2(n) is produced on the even-numberedclock cycles.

3.3.3 Higher levels of decomposition, PE3: The thirdprocessing element PE3 is pipelined to the output of PE2. Ithas two multipliers and each multiplier performs thecomputations related to three coefficients of the filter.This kind of folded-like computation is made possible,since the output from PE2, a2(n), is at the rate of onesample every 2 clock cycles. Therefore the input data canbe replicated without increasing the period in order toallow each multiplier to perform the necessary number ofoperations before new data are input from PE2.

Fig. 6b shows the functional analysis diagram ofPE3 for a 9-tap biorthogonal filter. The computation is

IEE Proc.-Vis. Image Signal Process., Vol. 153, No. 6, December 2006

Page 5: RAPID PROTOTYPING - Framework for FPGA-based discrete biorthogonal wavelet transforms implementation

Fig. 4 Arc1D-I

a PE1 architecture for 9 tap low-pass analysis filterb Functional analysis diagram for PE1

Arrows indicate additions

Fig. 5 Arc1D-I

a PE2 architecture for 9 tap low-pass analysis filterb Functional analysis diagram for PE2

Fig. 6 Arc1D-I

a PE3 architecture for 9 tap low-pass analysis filterb Functional analysis diagram for PE3

periodic with a period of 4 clock cycles. The periodsare divided into four subperiods controlled by a 2-bitselect signal S. In the subperiods S ¼ 1, 2, 3, Mk performsthe multiplication of the input sample by the filtercoefficients h1, h2, h3 and adds these products to thedata stored in the buffer in each subperiod. In the

IEE Proc.-Vis. Image Signal Process., Vol. 153, No. 6, December 2006

subperiod S ¼ 0, the multiplier Mk performs themultiplication of the input sample by the filter coefficientsh0 and h4 and adds these products to the data storedin the buffer and then outputs the result. By this way,a3(n) is produced at the rate of one sample every 4 clockcycles.

725

Page 6: RAPID PROTOTYPING - Framework for FPGA-based discrete biorthogonal wavelet transforms implementation

3.4 Arc1D-II: a hybrid-pipeline architecture

As described in the previous section, 1-D DBWT can bepipelined into J PEj (1 � j � J ), each PEi being devotedto compute the decomposition level j. Nevertheless, thedownsampling occurring in each decomposition levelmakes the fully pipelined architectures heavily under-utilised, since the stage implementing the decompositionlevel j is usually clocked by a frequency 2 j21 times lowerthan the clock frequency used in the first level.

In order to avoid this underutilisation, we propose ahybrid-pipeline architecture for 1-D DBWT consisting oftwo PEs [29]. PE1 is devoted to perform the first level ofdecomposition ( j ¼ 1), whereas the second PE2 is respon-sible for the higher level of decompositions (2 � j � J )based on modified-RPA approach. A top-level scheme ofthe architecture is given in Fig. 7.

The Arc1D-II architecture has the same PE1 for proces-sing of the first level of decomposition as described inSection 3.3.1 in detail.

3.4.1 Higher level of decomposition, PE2: When morethan one level of decomposition is needed, the subbanda1 produced by PE1 has to be further transformed. Thisrequires a second PE, which has to be pipelined to theoutput of PE1.

Since the output from PE1 is a sequential data stream ofN1 ¼ N0/2 samples at the rate of one sample per clockcycle, an RPA-based architecture that performs (J 2 1)levels DBWT of an N1 sample input data in N1 clockcycles can be used for the higher levels of decomposition(2 � j � J ).

The architecture of PE2 implements the RPA algorithm.The main components of this architecture are 1-D DBWTcore (including low-pass and high-pass filters) and astorage unit of size L for each decomposition level j(2 � j � J ). The second level of decomposition is com-puted every other cycle, and all higher levels are computedbetween the two second level computations. Fig. 8 gives theblock diagram of architecture for PE2.

Fig. 7 Arc1D-II: top-level architecture for 1-D DBWT

726

The architecture of the 1-D DBWT core used in PE2 hasbeen designed based on the architecture proposed in [6], bycombining linear phase with polyphase decompositionproperties. It requires dL/4e number of multipliers andadders as shown in Fig. 5a. Therefore it provides up tofour times improvement in hardware cost over a direct-form structure. The functional analysis diagram for theDBWT core is provided in Fig. 5b. The resulting PE2 archi-tecture performs (J 2 1) levels DBWT of an N1 samplessequence in N1 clock cycles.

4 2-D DBWT architectures

4.1 2-D DBWT: mathematical review

The 2-D DWT is a multilevel decomposition techniquewhich provides an efficient analysis method of signals atdifferent frequency bands. In it, each decomposition levelj can be seen as the further decomposition of a 2-D dataset I j21 (having N j21

� N j21 samples) into four subbandsLLj, LH i, HLj and HH j (each having N j21/2 samples).

In the separable approach, such a decomposition can beimplemented as a set of row-wise and column-wise filterbanks comprising a high-pass (H) and a low-pass (L)filter, each followed by downsampling by two, as depictedin Fig. 9a. The filtered and decimated approximationoutput LL is recursively passed through similar filterbanks to add the dimension of varying resolution at everystage.

In the non-separable (or direct) approach, the decompo-sition is computed by four 2-D convolutions followed bya decimation by 2 in both horizontal and vertical directionsas shown in Fig. 9b.

These decimated 2-D convolutions can be defined asfollows

LLjðm; nÞ ¼XL�1

i¼0

XM�1

k¼0

llði; kÞ � LLj�1ð2m� i; 2n� kÞ ð9Þ

LHjðm; nÞ ¼XL�1

i¼0

XM�1

k¼0

lhði; kÞ � LHj�1ð2m� i; 2n� kÞ ð10Þ

HLjðm; nÞ ¼XL�1

i¼0

XM�1

k¼0

hlði; kÞ � HLj�1ð2m� i; 2n� kÞ ð11Þ

HHjðm; nÞ ¼XL�1

i¼0

XM�1

k¼0

hhði; kÞ � HHj�1ð2m� i; 2n� kÞ ð12Þ

where ll(i, k), lh(i, k), hl(i, k) and hh(i, k) are the coefficientsof the low–low, low–high, high–low and high–high(L � M )-tap 2-D filter bases, respectively. LLj(m, n),

Fig. 8 Arc1D-II

a PE2 architecture for 9 tap low-pass analysis filterb Functional analysis diagram for PE2

IEE Proc.-Vis. Image Signal Process., Vol. 153, No. 6, December 2006

Page 7: RAPID PROTOTYPING - Framework for FPGA-based discrete biorthogonal wavelet transforms implementation

Fig. 9 Block diagram

a Separable 2-D DWTb Non-separable 2-D DWT for two levels of decomposition

LH j(m, n), HL j(m, n) and HH j(m, n) are the low–low,low–high, high–low and high–high subbands producedat the decomposition level j. As a special case, LLj21 forj ¼ 1 represents the input image I.

The 2-D convolution by (L � M )-tap filter can be seen asthe sum of L (1-D) convolutions performed by M-tap filterson the rows of input data set while 1-D filters being the rowsof 2-D filter. A block diagram of an (L � M )-tap 2-D filter,which implements the 2-D convolution of a row-wiseN � N input data, is shown in Fig. 10a. It consists of apipe of (L 2 1) row-delay circuits where each row-delaycircuit has N delay elements (ND) and L (1-D) filters Pi

(i ¼ 0, 1, . . . , L 2 1) operating in parallel on L consecutiverows of input data. The 1-D filters (Pi) are used to computeinner summations in (1–4), which represent decimated 1-Dconvolutions. The row-adder block computes the outer sums.

As a direct consequence of the decimation by 2 along the‘vertical’ direction (or along the rows), the even-numberedand odd-numbered rows of the input data set can be simul-taneously fed and processed in parallel by the use of a splitrow-delay circuit as shown in Fig. 10b. In this way,decimation by 2 in the vertical direction is directlyperformed.

4.2 Derivation of 2-D DBWT

Because of the symmetry along rows and columns of(L � M )-tap biorthogonal filters, their filter coefficients(ll(i, k), lh(i, k), hl(i, k) and hh(i, k)) can be written asfollows

llðdL=2e � i; dM=2e � kÞ ¼+llðdL=2e � i; dM=2e þ kÞ

¼ llðdL=2e þ i; dM=2e � kÞ

¼+llðdL=2e þ i; dM=2e þ kÞ

ð13Þ

where d e represents maximum integer and i ¼ 0, 1, . . . ,dL/2e2 1 and k ¼ 0, 1, . . . , dM/2e2 1.

For the sake of illustration, the case of a (9 � 7)-tapbiorthogonal filter (L ¼ 9 and M ¼ 7) will be considered inthe rest of the paper. For a (9 � 7)-tap filter, the symmetryalong with the rows of the filter (vertical symmetry) can

IEE Proc.-Vis. Image Signal Process., Vol. 153, No. 6, December 2006

be written as

llð4� i; kÞ ¼ llð4þ i; kÞ ð14Þ

which means that the (4 2 i)th and (4þ i)th 1-D filters inFig. 10a (where i ¼ 0, 1, . . . , 4) use the same filtercoefficients. Therefore only one 1-D filter can be used,instead of two, in order to process the point-wise sum ofrows: [I](2m2i) and [I](2mþi). As a result, the number of1-D filters can be reduced from L to dL/2e as shown inFig. 10c.

Because of the vertical symmetry along the fifth row,(9–12) for the first 1-D filter (P0) can be re-written as

LLjf0;8gðm; nÞ ¼

X8

i¼0

X6

k¼0

llð0; kÞ � ðLLj�1ð2m� 8; 2n� kÞ

þ LLj�1ð2m; 2n� kÞÞ ð15Þ

LHjf0;8gðm; nÞ ¼

X8

i¼0

X6

k¼0

hlð0; kÞ � ðLHj�1ð2m� 8; 2n� kÞ

þ LHj�1ð2m; 2n� kÞÞ ð16Þ

HLjf0;8gðm; nÞ ¼

X8

i¼0

X6

k¼0

lhð0; kÞ � ðHLj�1ð2m� 8; 2n� kÞ

þ HLj�1ð2m; 2n� kÞÞ ð17Þ

HHjf0;8gðm; nÞ ¼

X8

i¼0

X6

k¼0

hhð0; kÞ � ðHHj�1ð2m� 8; 2n� kÞ

þ HHj�1ð2m; 2n� kÞÞ ð18Þ

where the subscript f0,8g corresponds to the point-wise sumof the first and the eighth rows. Similarly, the subscriptsf1,7g, f2,6g, f3,5g and f4g are used for P1, P2, P3 and P4

as shown in Fig. 10c.The centre for horizontal symmetry is the fourth column

and it is defined as

llði; 3� kÞ ¼ llði; 3þ kÞ ð19Þ

which means that the x(3 2 i)th and x(3þ i)thinput samples entering (1-D) filter P0 are multiplied bythe same coefficients. Therefore the inner summations (for

727

Page 8: RAPID PROTOTYPING - Framework for FPGA-based discrete biorthogonal wavelet transforms implementation

Fig. 10 Block diagram

a (L � M )-tap 2-D filter by means of L ¼ 9 (1-D) convolutions (Pi). Each (Pi) is an M-tap 1-D filterb (L � M )-tap 2-D filter with two individual data pipes for even-numbered and odd-numbered rows leading to the direct computation of decimatedoutputc Biorthogonalised version of (L � M)-tap 2-D filter (where L ¼ 9) in (b) by means of dL/2e¼5 (1-D) filters ([I ]m denotes the mth row of inputimage I)

P0) in (15)–(18) become

LLjf0;8gðnÞ ¼ llð0;3Þ � xf0;8gð2n� 3Þ

þX2

i¼0

ðxf0;8gð2n� 6� iÞ þ xf0;8gð2n� iÞÞ ð20Þ

LHjf0;8gðnÞ ¼ lhð0;3Þ � xf0;8gð2n� 3Þ

þX2

i¼0

ðxf0;8gð2n� 6� iÞ þ xf0;8gð2n� iÞÞ ð21Þ

HLjf0;8gðnÞ ¼ hlð0;3Þ � xf0;8gð2n� 3Þ

þX2

i¼0

ðxf0;8gð2n� 6� iÞ þ xf0;8gð2n� iÞÞ ð22Þ

HHjf0;8gðnÞ ¼ hhð0;3Þ � xf0;8gð2n� 3Þ

þX2

i¼0

ðxf0;8gð2n� 6� iÞ þ xf0;8gð2n� iÞÞ ð23Þ

728

where xf0,8g(n) represents the nth element of the point-wisesummation of the first and ninth rows, which is fed into P0.

4.3 Arc2D-I: separable MRPA-based architecture

The proposed separable 2-D DBWT architecture is amodified version of the direct architecture described in[12] and it is shown in Fig. 11. The separable 2-D DBWTarchitecture consists of a delay line, a filter bank and amemory unit of J register blocks (Rj) in order to store inter-mediate outputs. The systolic filters, PE1, are based on thework presented in [6] as shown in Fig. 5. It exploits the deci-mation by 2 in wavelet transform and the anti/symmetricalproperty of biorthogonal wavelet filters. Therefore itachieves a reduction in the number of multipliers by afactor of 4. The memory unit consists of J register blocks,each storing Nj x(L 2 1) words, where L is the filter lengthand Nj is the input data size at decomposition level j.By organising the memory into blocks, the coefficientsare automatically transposed into column major format.The inputs to the filter bank are row-based and multiplexed

IEE Proc.-Vis. Image Signal Process., Vol. 153, No. 6, December 2006

Page 9: RAPID PROTOTYPING - Framework for FPGA-based discrete biorthogonal wavelet transforms implementation

between the output of the delay line and the output ofthe memory unit. By doing so, a simple control and routingcan be achieved without the need for N2 memory units.

The computation of different levels of decomposition isscheduled according to row-based RPA scheduling [20].The entire row of input image (or intermediate LL, LH,HL and HH results) is fed into the filter bank at a time.This scheduling uses a buffer to store a single row of LLj

coefficients for each decomposition level required. Sincejust a single row-delay line is processed, the controllerand interconnect routing complexity are also reducedcompared to other architectures [12].

Fig. 11 Arc2D-I: top-level architecture for separable 2-DDBWT

IEE Proc.-Vis. Image Signal Process., Vol. 153, No. 6, December 2006

4.4 Arc2D-II: non-separable 2-D DBWTarchitecture

4.4.1 Top-level architecture: Equations (15)–(18) and(20)–(23) can be mapped into the proposed architecture,shown in Fig. 12 [30]. The design of the non-separable2-D discrete biorthogonal wavelet filter architecture hasbeen derived from modified-recursive-pyramid-algorithm-based (MRPA-based) architecture in [23]. MRPA-basedarchitecture exploits the downsampling of output subbandsand performs the first decomposition level interspersedwith all other levels by means of only one processingunit. The top-level architecture for the (9 � 7) non-separable 2-D biorthogonal wavelet filter with three levelsof decomposition (J ¼ 3) is shown in Fig. 12.

The architecture is composed of a set of dL/2e 1-D filterprocessors (Pi) and J sets of row-delay circuits, Rj beingused for the jth level of decomposition where j ¼0, 1, . . . , J. Each row-delay circuit Rj is composed of apipe of (L 2 1) row-delay elements with N/2j memorycells. R0 stores the rows of the input image, whereas Rj.0

are used to store the rows of LLj subband, which are usedas input for computing the decomposition level jþ 1.

The even-numbered and odd-numbered rows of the inputimage are fed simultaneously into processors P2i and P2iþ1

in a word-serial fashion by using two distinct row-delaypipes so that the decimated output can be directly computed.(This kind of parallel input I/O has also been exploited byother devices in the literature [12, 31–33].) Computation of

Fig. 12 Arc2D-II: non-separable (9 � 7)-tap 2-D DBWT top-level architecture

Even and odd numbered rows of the input 2-D frames are fed simultaneously with the use of split data pipes. In order to achieve this, even and oddrows can be stored on different internal memory blocks. Otherwise, the input data can be sampled in a zig-zag manner at a frequency which is twicethat of the circuit operating frequency

729

Page 10: RAPID PROTOTYPING - Framework for FPGA-based discrete biorthogonal wavelet transforms implementation

[LL ]10

[LL ]11

[LL ]20

[LL ]12 [LL ]1

3

[LL

3] 0

[LL ]14 [LL ]1

5 [LL ]16

N 2N 5N2

7N2

9N2

[LL ]21

[LL ]22

10N

Time(ccs)

firstlevel

secondlevel

thirdlevel

41N4

Fig. 13 MPRA-like scheduling for LL outputs

[LLk]j denotes the kth row of the LL output at decomposition level j. The [LLj]2k output is scheduled as soon as two rows ([LLj21]2k and [LLj21]2kþ1)

from previous level are computed. It takes N, N/2 and N/4 clock cycles for the computation of the first-, second- and third-level row outputs,respectively

the different levels of decomposition is scheduled accordingto an algorithm that differs from the MRPA, since the pro-cessors in this architecture require the parallel input of oddand even rows [23]. The kth row of each subband at thedecomposition level j, [LL( j)]2k, can only be computedwhen two adjacent rows from the previous level([LL( j21)]2k and [LL( j21)]2kþ1) are produced and stored in[Rj21]0 and [Rj21]1. The rows LH j, HL j and HH j areimmediately output, whereas the row LLj is fed back andstored either in [Rj]0 if k is even or in [Rj]1 if j is odd.The MPRA-like scheduling for [LLj]k subband outputs isdepicted in Fig. 13.

4.4.2 (1-D) filter processor: The number of (1-D) filterprocessors has been reduced from L to dL/2e by exploitingthe symmetry property along rows of biorthogonal filters, asexplained in (12–15). In the case of a (9 � 7)-tap 2-D filter,because of the symmetry along the fifth row of the filterkernel, relevant input rows (i.e. [I]0 and [I]8, [I]1 and[I ]7) are added and then fed into the associated processor,(i.e. P0 and P1) as shown is Fig. 12.

Each processor Pi in the architecture is composed of aset of processing elements (PEi,j) and two coefficientadders as shown in Fig. 14a. The number of PEs hasbeen reduced from M ¼ 7 to dM/2e ¼ 4 by exploiting thesymmetry properties of (9 � 7)-tap biorthogonal filters

730

along the fourth column. The relevant samples on theinput row are added and connected to the processingelement (PE) before the actual multiplication is performed.Each PE consists of two multipliers (as shown in Fig. 14b).The multiplier M1i,j uses ll(i, j) and hl(i, j) in an interleavedfashion (controlled by signal Q) and computes either A(i, j)in even clock cycles or C(i, j) in odd clock cycles. Similarly,M2i, j computes B(i, j) and D(i, j) by using lh(i, j) and hh(i,j) in even and odd clock cycles, respectively. Once theseterms are computed, they are fed into coefficient adders inthe PEs (‘coeff. adder_1’ and ‘coeff. adder_2’ inFig. 14a), which produces 1-D decimated filter outputs.Finally, these 1-D decimated filter outputs from each pro-cessor are fed into row adders on the top level (Fig. 12) inorder to produce LL and LH in even clock cycles and HLand HH in odd clock cycles. The LL output is recursivelypassed through the same filter, according to the number ofdecomposition levels required. The functional timingdiagram of P0 for the first six outputs is shown in Fig. 15.

Since the input data are fed serially into processors, thecontrol system of the architecture needs an O(L) routingnetwork, two J-output demultiplexers and J-input multi-plexers. Therefore its control circuit is simpler than existingarchitectures based on parallel filters which require O(L2)routing network and L2 number of multiplexers anddemultiplexers [14].

Fig. 14 Block diagram of 1-D filter processor

a 1-D filter processorb Processing element (PE) architecture used in non-separable 2-D DBWTLLf0,8g, LHf0,8g, HLf0,8g and HHf0,8g are the inner sums given by (12–15) for P0

IEE Proc.-Vis. Image Signal Process., Vol. 153, No. 6, December 2006

Page 11: RAPID PROTOTYPING - Framework for FPGA-based discrete biorthogonal wavelet transforms implementation

Fig. 15 Functional analysis diagram for 1-D filter processor, P0, where y(n) ¼ ([I]2mþ [I]2m28)(2n) fed to P0

For the sake of illustration, only functionality of multiplier M1(0,i) is depicted (an arrow indicates summation)

4.4.3 Symmetric extension: The periodic symmetricextension of input 2-D frame boundaries is defined inJPEG-2000 standard [34] as shown in Fig. 16. The 2-Dinput frame is extended by reflecting dL/2e rows to theup and down of the frame, and dM/2e columns to the leftand right of the frame, where L and M are filter dimensions.The periodic symmetric extension has been achieved withthe incorporation of horizontal and vertical symmetricextension routers (HSER and VSER) along with delaylines as shown in Fig. 17.

The vertical delay line and VSER are basically composedof a number of row-delay circuits for each level with delayelements. The VSER handles the routing of extension rowsalong even and odd input lines. The VSER introduces adelay of N . dL/2e clock cycles.

The extension of each input (1-D) row is managed byhorizontal delay line and the HSER in each processor.The first and last samples in each input row are delayedand then routed appropriately. The HSER introduces adelay of M/2 clock cycles.

5 FPGA implementation

5.1 Development environment andimplementation approach

Handel-C is a high-level language that is at the heart of ahardware compilation system known as Celoxica DK2[35] which is designed to compile programs written in aC-like high-level language into synchronous hardware.One of the advantages of using hardware is the ability toexploit parallelism directly. Because standard C is a sequen-tial language, Handel-C has additional constructs to supportthe parallelisation of code and to allow fine control overwhat hardware is generated. DK2 produces a Netlist file,which is used during the place and route stage to generatethe image or bitstream file (Fig. 10a). The RC1000-PP co-processor board used is a standard PCI bus card equippedwith a large FPGA chip. It has 8 MB of SRAM directly con-nected to the FPGA in four 32-bit wide memory banks. Allare accessible by the FPGA and any device on the PCI bus.Different methods of data transfer from the host PC or theenvironment to the FPGA are available as follows:

Fig. 16 Periodic symmetric extension scheme in 1-D whereN ¼ 16 and M ¼ 7

IEE Proc.-Vis. Image Signal Process., Vol. 153, No. 6, December 2006

† bulk transfers of data between FPGA and PCI bus areperformed through the memory banks 0 to 3;† streams of bytes are most conveniently communicatedthrough the unidirectional 8-bit control and status ports(Fig. 18b).

The RC1000-PP board is supported with a macro librarythat simplifies the process of initialising and talking to thehardware. This library comprises a set of driver functionswith the following functionality:

† initialisation and selection of a board;† handling of FPGA configuration files;† data transfer between the PC and the RC1000-PP board;† function to help with error checking and debugging.

These library functions can be included in a C or Cþþprogram that runs on the host PC and performs data transfervia the PCI bus.

5.2 Performance evaluation

In order to verify the performance of the DBWT architec-tures, the biorthogonal wavelet filter designs have beenported to a Virtex-E2000 FPGA chip (package: bg560,speed grade 6) [36] using Handel-C [35]. The designs ofthe architectures have been parameterised in terms of:

† filter length (L, L � M );† input image size (N );† number of decomposition levels (J );† input and output data wordlength (Wi and Wo);† filter coefficients wordlength (Wc).

A 2’s complement representation for all input/output datahas been used in the designs. The core operation in eachalgorithm is multiply accumulation. The multiplication oftwo numbers comprising Wi and Wc bits results in theoutput having a wordlength of WiþWc bits. To preventoverflows, the wordlength in the ‘addition path’ is definedas (WiþWcþWa), where Wa is the number of adders.Handel-C model simulations showed that for a 9-bit inputdata wordlength, the output wavelet coefficients from alllevels are bounded by 216. Therefore output data word-length of 16 bits is enough to represent wavelet coefficientswithout causing any overflow.

In this paper, biorthogonal (9,7) and (9,3) 1-D waveletfilters up to three decomposition levels have beenimplemented on FPGA. To make a fair comparison withexisting works, the filter coefficient and input data wor-dlength of 9 bits are used, whereas the output data are rep-resented by 9 bits. In Arch1D-I, the wordlengths wereselected as 11-bit input data and 9-bit coefficient valuesfor the PE2 and 14-bit data and 9-bit coefficient values forthe PE3, incorporating an 8-bit truncation between PEs.

731

Page 12: RAPID PROTOTYPING - Framework for FPGA-based discrete biorthogonal wavelet transforms implementation

Fig. 17 Horizontal and vertical symmetric extension routers

a Horizontal periodic symmetric extension (HSER) unitb Vertical periodic symmetric extension (VSER) unit

For Arc1D-II implementation, the internal wordlength accu-racy is 21 and 25 bits for PE1 and PE2, respectively, incor-porating a 6-bit truncation between two PEs. An internalquantisation of 9-bit to the output of PE2 is applied inorder to avoid the growth of wordlength beyond 16 bits.

FPGA implementation performances for 1-D DBWTs arereported in Table 1 in terms of area and maximum clock fre-quency ( fmax). Since each wavelet decomposition levelrequires a new 1-D filter block in Arch1D_1, the number

Fig. 18 Handel-C design flow and schematic view of FPGA

a Handel-C design flow targeting FPGAsb Schematic view of the FPGA/banks part in the RC1000-PP board

732

of FPGA slices occupied increases and fmax reduces slightlyas wavelet decomposition level implemented increases. Inthe case of Arch1D_2, only N registers are required foreach new decomposition level since an RPA-based architec-ture is used. However, the complex control logic causes fmax

to reduce as number of decomposition levels increases.Table 2 shows the comparison between proposed 1-DDBWT architectures (Arc1D-I and Arc1D-II) and existingFPGA implementations. It can be seen that the proposedarchitecture compares favourably in terms of area andspeed in comparison to implementations proposed in[15, 16]. Although the bit-level implementations in [17]present a better area/speed ratio, high computation timeand low-throughput rate features make them unsuitablefor high-speed/high-throughput applications. (This archi-tecture performs just first-level decomposition.)

In addition to 1-D DBWTs, (9,7)-tap separable and non-separable 2-D biorthogonal filters up to three levels ofdecomposition have been implemented. The register blockin separable architecture and row-delay circuits (Rj) in thenon-separable architecture as well as in vertical SERshave been implemented by Wix(N/2k)-bit FIFOs with theuse of embedded block RAMs in the Virtex-E FPGAdevice [36]. Therefore the increase in hardware area interms of FPGA slice has been prevented. Table 3 showsimplementation results for separable and non-separablearchitectures, in terms of FPGA area occupied and

Table 1: Performance results for FPGA implemen-tations of the proposed 1-D DBWT architectures

Design Wavelet

type

Decomposition

levels

Area

(slices)

fmax

(MHz)

Arc1D_I Bior(9,7) 1 453 159

Bior(9,7) 3 2058 131

Bior(9,3) 1 340 165

Arc1D_II Bior(9,7) 1 453 159

Bior(9,7) 3 1402 112

Bior(9,3) 3 1044 117

IEE Proc.-Vis. Image Signal Process., Vol. 153, No. 6, December 2006

Page 13: RAPID PROTOTYPING - Framework for FPGA-based discrete biorthogonal wavelet transforms implementation

Table 4: 2-D DBWT performance comparison with existing FPGA-based separable DBWT implementations

Design Wavelet FPGA FPGA area fmax (MHz) Latency (clock cycles)

Slices BRAMs

Amphion [21] 9/7 Lifted Virtex E-8 3784 24 55 N2

Cast (Line based) [38] 9/7 Conv Virtex 300E-8 2293 14 50 5.7 . N2

Cast (Block based) [39] 5/3 Lifted Virtex 400E-8 971 10 51 3 . N2

McCanny [20] 9/7 Conv Virtex-II 2559 17 44.1 1.5 . N2

Arc2D-I 9/7 Sep Virtex 1000E-6 2221 24 78 2 . N2

Arc2D-II 9/7 Non-Sep Virtex 1000E-6 4348 24 105 2/3 . N2

Table 2: Comparative evaluation of 1-D DBWT architectures for an L-tap biorthogonal filter: area and computationtime

Design Latency (clock cycles) Multipliers Adders FPGA implementation

Area (slices) fmax (MHz)

Pipelined [15] N0 J . dL/4e J . dL/2e 785 85.49

PA [16] JN0 dL/4e dL/2e n/a n/a

Bit-level [17] JN0Wi dL/2e 2 . L 69 70.2

Systolic [18]� N0/2 dL/2e 2 . L n/a 50

Ach1D-I N0/2P

j¼1JdL/2j

e ’P

j¼1JdL/2j

e 453 159.058

Ach1D-II N0/2 dL/2e þ dL/4e d3L/2e 453 159.058

�This architecture performs just first-level decompositionWi ¼Wc ¼ 9 bits and J ¼ 1 is considered for FPGA implementation results

Table 3: FPGA implementation results for 2-D DBWT architectures (Arc2D-I and Arc2D-II) on the Virtex-2000E chip forN 3 N 5 512 3 512

Levels Separable architecture (Arc2D-I) Non-separable architecture (Arc2D-II)

Area (slices) fmax (MHz) Area/speed ratio Area (slices) fmax (MHz) Area/speed ratio

1 1802 (10%) 84 21.45 3974 (21%) 112 35.48

2 2011 (12%) 80 25.14 4126 (22%) 110 37.50

3 2221 (13%) 78 28.47 4348 (23%) 105 41.40

maximum operating frequency ( fmax) for k ¼ 1, 2 and3. The following analysis can be derived from Table 3:

† The separable architecture requires fewer multiplierscompared to the non-separable architecture; thereforeit occupies less FPGA area than non-separable implement-ation does;† The routing complexity of non-separable architecture isless than separable architecture; it achieves bettermaximum operating frequency;† The non-separable architecture requires fewer clockcycles to compute the DBWT (2N2/3 against 2N2) andachieves higher fmax; therefore it outperforms theseparable architecture in terms of computation time.

Since this study presents the first hardware implementationof non-separable 2-D wavelet transforms on FPGAs in theliterature, we can provide performance comparison onlywith existing separable 2-D DBWT-based FPGA imple-mentations. The performance comparisons are shown inTable 4 based on the criteria of wavelet type, FPGA area,memory and latency. It can be seen from the table thatfmax for proposed implementation is nearly twice that ofthe other implementations. This result has been achievedbecause of the simple routing and control circuit of the pro-posed architecture. Also, the proposed architecture has alatency of 2/3N2 clock cycles which is at least 33% better

IEE Proc.-Vis. Image Signal Process., Vol. 153, No. 6, December 2006

than the others. Although it requires at least 15% moreFPGA slices compared to other implementations, the pro-posed architecture can be used at twice lower frequency( fmax/2) but still achieving the same performance (interms of computation time) with the other implementations.This approach leads to reduction in the power consumptionby four. (Reducing operating frequency by K leads to areduction in the supply voltage and the powerconsumption by K and K2, respectively, [37].) Moreover,when the wavelet basis functions are not separable, onlythe proposed non-separable DBWT architecture and itsimplementation can be used.

6 Conclusions

We have presented reconfigurable hardware devices in anattractive combination of low cost and high performancecombined with an apparent flexibility. Although usersmust program FPGAs at a very low level and have adetailed knowledge of the architecture of the device beingused, they remain very good target devices for rapid proto-typing. This paper has described the development of ageneral framework for FPGA-based biorthogonal wavelettransform implementation. With the use of the proposedsystem, efficient scalable and modular 1-D and 2-DDBWT architectures can be automatically generated and

733

Page 14: RAPID PROTOTYPING - Framework for FPGA-based discrete biorthogonal wavelet transforms implementation

mapped on to FPGAs targeting real-time signal and imageprocessing applications.

7 References

1 Daubechies, I.: ‘The wavelet transform, time-frequency localizationand signal analysis’, IEEE Trans. Inf. Theory, 1990, 36, pp. 961–1005

2 Mallat, S.G.: ‘A theory for multiresolution signal decomposition: thewavelet representation’, IEEE Trans. Pattern Anal. Mach. Intell.,1989, 11, (7), pp. 674–693

3 Daubechies, I.: ‘Where do wavelets come from? – a personalpoint of view’. Proc. IEEE, 1996, 84, pp. 510–513

4 Graps, A.: ‘An introduction to wavelets’, IEEE Comput. Sci. Eng.,1995, 2, pp. 50–61

5 Cohen, A., Daubechies, I., and Feauveau, J.-C.: ‘Biorthogonal bases ofcompactly supported wavelets’, Commun. Pure Appl. Math., 1992, 45,(5), pp. 485–560

6 Masud, S., and McCanny, J.: ‘Finding a suitable wavelet for imagecompression applications’. Proc. IEEE Int. Conf. Acoust. Speech,Signal Process. (ICASSP ’98), 1998, vol. 5, pp. 2581–2584

7 Bradley, J.N., and Brislawn, C.M.: ‘The wavelet/scalar quantizationcompression standard for digital fingerpint images’. Proc. IEEE Int.Symp. Circuits Syst. (ISCAS ’94), 1994, vol. 3, pp. 205–208

8 Denk, T., and Parhi, K.: ‘VLSI architectures for lattice structure basedorthonormal discrete wavelet transforms’, IEEE Trans. Circuits Syst.II, Analog Digit. Signal Process., 1997, 44, pp. 129–132

9 Vaidyanathan, P.: ‘Multirate systems and filterbanks’ (Prentice-Hall,1993)

10 Lewis, A., and Knowles, G.: ‘VLSI architecture for 2-D Daubechieswavelet transform without multipliers’, Electron. Lett., 1991, 27,pp. 171–173

11 Parhi, K., and Nishitani, T.: ‘VLSI architectures for discrete wavelettransform’, IEEE Trans. VLSI Syst., 1993, 1, pp. 191–202

12 Vishwanath, M., Owens, R., and Irwin, M.: ‘VLSI architectures for thediscrete wavelet transform’, IEEE Trans. Circuits Syst. II, AnalogDigit. Signal Process., 1995, 42, (5), pp. 305–316

13 Chakrabarti, C., Vishwanath, M., and Owens, R.: ‘A survey ofarchitectures for the discrete and continuous wavelet transforms’,J. VLSI Signal Process. Syst., 1996, 3, (43), pp. 171–192

14 Weeks, M., and Bayoumi, M.: ‘Discrete wavelet transform:architectures, design and performance issues’, J. VLSI SignalProcess. Syst., 2003, 35, (2), pp. 155–178

15 Masud, S., and McCanny, J.: ‘Reusable silicon IP cores for discretewavelet transform applications’, IEEE Trans. Circuits Syst. I,Fundam. Theory Appl., 2004, 51, (6), pp. 1114–1124

16 Masud, S.: ‘VLSI systems for discrete wavelet transforms’. PhDthesis, The Queen’s University of Belfast: United Kingdom, 1999

17 Nibouche, A.B.M., and Nibouche, O.: ‘Rapid prototyping ofbiorthogonal discrete wavelet transforms on FPGAS’. Proc.IEEE Int. Symp. Circuits Syst. (ISCAS ’01), 2001, vol. 3,pp. 1399–1402

18 Jou, J., Shiau, Y., and Liu, C.-C.: ‘Efficient VLSI architectures forthe biorthogonal wavelet transform by filter bank and liftingscheme’. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS ’01), 2001,vol. 2, pp. 529–532

19 Benkrid, A., Benkrid, K., and Crookes, D.: ‘A novel approach fordiminishing and predicting the error dynamic range in finite

734

wordlength FIR based architectures’. Proc. IEEE Int. Conf. Acoust.Speech Signal Process., 2003, vol. 2, pp. 581–584

20 McCanny, P., Masud, S., and McCanny, J.: ‘Design andimplementation of the symmetrically extended 2-D wavelettransform’. Proc. IEEE Int. Conf. Acoust. Speech Signal Process.(ICASSP ’02), 2002, vol. 3, pp. 3108–3111

21 CS 6210 discrete wavelet transform core datasheet, Amphion. http://www.amphion.co.uk

22 Chakrabarti, C., and Vishvanath, M.: ‘Efficient realizations of thediscrete and continuous wavelet transforms: from single chipimplementations to mappings on SIMD array computers’, IEEETrans. Signal Process., 1995, 3, (43), pp. 759–771

23 Marino, F.: ‘Two fast architectures for the direct 2-D discrete wavelettransform’, IEEE Trans. Signal Process., 2001, 49, (6),pp. 1248–1258

24 Yu, C., and Chen, S.: ‘VLSI implementation of 2-D discrete wavelettransform for real-time video signal processing’, IEEE Trans.Consum. Electron., 1997, 43, (4), pp. 1270–1279

25 Tessier, R., and Burleson, W.: ‘Reconfigurable computing for digitalsignal processing: a survey’, J. VLSI Signal Process. Syst., 2001, 28,(1–2), pp. 7–27

26 Amira, A., Bouridane, A., and Milligan, P.: ‘RCMAT: areconfigurable coprocessor for matrix algorithms’. Proc. NinthACM/IEEE Int. Symp. Field Program. Gate Arrays (FPGAs), 2001,p. 228

27 Amira, A.: ‘A custom coprocessor for matrix algorithms’. PhD thesis,The Queen’s University of Belfast, United Kingdom, 2000, http://www.cs.qub.ac.uk/~a.amira

28 Uzun, I.S., Amira, A., and Bouridane, A.: ‘FPGA implementations offast Fourier transforms for real-time signal and image processing’,IEE Proc., Vis., Image Signal Process., 2005, 152, pp. 283–296

29 Uzun, I.S., Amira, A., and Bouridane, A.: ‘An efficient architecture for1-D discrete biorthogonal wavelet transform’. Presented at IEEE Int.Symp. Circuits Syst. (ISCAS ’04), 2004

30 Uzun, I.S., and Amira, A.: ‘Design and FPGA implementationof non-separable 2-D biorthogonal wavelet transforms for image/video coding’. IEEE Int. Conf. Image Process. (ICIP ’04), 2004,,pp. 2825–2828

31 Chuang, H., and Chen, L.: ‘VLSI architecture design for fast 2-Ddiscrete orthonormal wavelet transform’, J. VLSI Signal Process.,1995, 10, pp. 225–236

32 Marino, F.: ‘Efficient high-speed/low-power pipelined architecturefor the direct 2-D discrete wavelet transform’, IEEE Trans CircuitsSyst. II, Analog Digit. Signal Process., 2000, 47, pp. 1476–1491

33 Dillen, G., Georis, B., Legat, J., and Cantineau, O.: ‘Combined line-based architecture for the 5-3 and 9-7 wavelet transform of jpeg2000’, IEEE Trans. Circuits Syst. Video Technol., 2003, 13,pp. 944–950

34 Boliek, C.C.M., and Majani, E.: ‘JPEG 2000 part I final committeedraft version 1.0’, March 2000

35 ‘Handel-C language reference manual’, Celoxica. www.celoxica.com36 Virtex-E 1.8V FPGA complete data sheet, Xilinx, 2002. http://direct.

xilinx.com/bvdocs/publications/ds022.pdf37 Liu, D., and Svenson, C.: ‘Trading speed for low power by choice of

supply and threshold voltages’, IEEE J. Solid State Circuits, 1993, 1,(28), pp. 10–17

38 LB_DFDWT line-based programmable forward DWT core datasheet,Cast Inc., http://www.cast-inc.com

39 BB_DFDWT block-based forward discrete wavelet transform coredatasheet, Cast Inc., http://www.cast-inc.com/

IEE Proc.-Vis. Image Signal Process., Vol. 153, No. 6, December 2006