Wiener Filter Hardware Realization

Wiener Filter Realization using Hardware.QR decomposition of matrices and inversion

by Givens’ Rotation***************************************

7th Semester Project Report

Akashdip DasAbantika Chowdhury

Sayan ChaudhuriGuide : Dr. Ayan Banerjee

Electronics and Telecommunication Engineering Department

December, 2016

1

Contents

1 Abstract 3

2 Introduction 3

3 Wiener Filtering 4

4 Q-R decomposition of a matrix 6

5 Hardware for inversion of an upper triangular matrix(R) 95.1 Storage in a RAM . . . . . . . . . . . . . . . . . . . . . . . . . 105.2 Address generation Mechanism . . . . . . . . . . . . . . . . . 105.3 Hardware for finding the inverse of diagonal elements . . . . . 125.4 Hardware for the finding the inverse of the other elements . . 13

6 Conclusions 156.1 Multi PORT RAM for faster performance . . . . . . . . . . . 156.2 Distributed Arithmetic for computing the product of the two

matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

7 Acknowledgements 17

2

1 Abstract

Super-resolution reconstruction is a method for reconstructing higher reso-lution images from a set of low resolution observations. The sub-pixel differ-ences among different observations of the same scene allow to create higherresolution images with better quality. In the last thirty years, many methodsfor creating high resolution images have been proposed. However, hardwareimplementations of such methods are limited. Wiener filter design is oneof the techniques we will use initially for this process. Wiener filter designinvolves matrix inversion. A novel method for the matrix inversion has beenproposed in the report. QR decomposition will be the computational algo-rithm used using Givens Rotation.

2 Introduction

The process of super resolution initially requires that the image be restoredfrom the effects of noise and degradation(assumed isotropic). For that pur-pose the Wiener Filter is used that basically helps in forming an estimate ofthe image from the degraded one.The fundamentals of the Wiener Filteringhas been discussed in Section 3. The Wiener Filtering requires generation ofthe inverse of a given matrix The method followed here is the QR Decompo-sition(discussed in Section 4). The QR decomposition involves generation ofan upper triangular matrix which we will be inverting in the proposed algo-rithm. Various techniques for decomposition of the matrix has been discussedin papers [3],[4]. However the inversion of a matrix proposed by them wasnot sufficient for the general solution for the problem. Rather the solutionwas illustrated for a specific system of 3x3 matrix. The QR decompositioninvolves forming an upper triangular matrix and an orthogonal matrix. Theinversion of an orthogonal matrix is simply obtained by computing its trans-pose. The inversion of the upper triangular matrix has been discussed in thispaper. The solutions available for this process is for a 3x3 or 4x4 system.So in this paper we have generalized the inversion to a nxn system. Thehardware that is required for this purpose has been developed in Section 5along with sound reasoning and justification. The hardware that has beendeveloped has scopes for enhanced performance that has been discussed insection 6

3

3 Wiener Filtering

In signal processing, the Wiener filter is a filter used to produce an estimateof a desired or target random process by linear time-invariant (LTI) filteringof an observed noisy process, assuming known stationary signal and noisespectra, and additive noise. The Wiener filter minimizes the mean squareerror between the estimated random process and the desired process. Thegoal of the Wiener filter is to compute a statistical estimate of an unknownsignal using a related signal as an input and filtering that known signal toproduce the estimate as an output. For example, the known signal mightconsist of an unknown signal of interest that has been corrupted by additivenoise. The Wiener filter can be used to filter out the noise from the corruptedsignal to provide an estimate of the underlying signal of interest. he Wienerfilter is based on a statistical approach based on MMSE (Minimum MeanSquare Error).The causal finite impulse response (FIR) Wiener filter, insteadof using some given data matrix X and output vector Y, finds optimal tapweights by using the statistics of the input and output signals. It populatesthe input matrix X with estimates of the auto-correlation of the input signal(T) and populates the output vector Y with estimates of the cross-correlationbetween the output and input signals (V).

In order to derive the coefficients of the Wiener filter, consider the signalw[n] being fed to a Wiener filter of order N and with coefficients {a0, · · · , aN}.The output of the filter is denoted x[n] which is given by the expression.x[n] =

∑Ni=0 aiw[n− i].

The residual error is denoted e[n] and is defined as e[n] = x[n] s[n] (see thecorresponding block diagram). The Wiener filter is designed so as to mini-mize the mean square error (MMSE criteria) which can be stated conciselyas follows:

ai = arg minE[e2[n]

], where E[·] denotes the expectation operator. In

the general case, the coefficientsai may be complex and may be derived forthe case where w[n] and s[n] are complex as well. With a complex signal, thematrix to be solved is a Hermitian Toeplitz matrix, rather than symmetricToeplitz matrix. For simplicity, the following considers only the case whereall these quantities are real. The mean square error (MSE) may be rewrittenas:

4

E[e2[n]

]= E

[(x[n]− s[n])2

]= E

[x2[n]

]+ E

[s2[n]

]− 2E[x[n]s[n]]

= E

( N∑i=0

aiw[n− i]

)2+ E

[s2[n]

]− 2E

[N∑i=0

aiw[n− i]s[n]

]To find the vector [a0, . . . , aN ] which minimizes the expression above, calcu-late its derivative with respect to each ai

∂

∂aiE[e2[n]

]=

∂

∂ai

E( N∑

i=0

aiw[n− i]

)2+ E

[s2[n]

]− 2E

[N∑i=0

aiw[n− i]s[n]

]= 2E

[(N∑j=0

ajw[n− j]

)w[n− i]

]− 2E[s[n]w[n− i]]

= 2

(N∑j=0

E[w[n− j]w[n− i]]aj

)− 2E[w[n− i]s[n]]

Assuming that w[n] and s[n] are each stationary and jointly stationary, thesequencesRw[m] and Rws[m] known respectively as the autocorrelation ofw[n] and the cross-correlation between w[n] and s[n] can be defined as fol-lows:

Rw[m] = E{w[n]w[n+m]}Rws[m] = E{w[n]s[n+m]}

The derivative of the MSE may therefore be rewritten as (notice thatRws[−i] = Rsw[i])

∂

∂aiE[e2[n]

]= 2

(N∑j=0

Rw[j − i]aj

)− 2Rsw[i] i = 0, · · · , N.

Letting the derivative be equal to zero results inN∑j=0

Rw[j − i]aj = Rsw[i] i = 0, · · · , N.

which can be rewritten in matrix formRw[0] Rw[1] · · · Rw[N ]Rw[1] Rw[0] · · · Rw[N − 1]

......

. . ....

Rw[N ] Rw[N − 1] · · · Rw[0]

︸︷︷︸

T

a0a1...aN

︸︷︷︸

a

=

Rsw[0]Rsw[1]

...Rsw[N ]

︸︷︷︸

v

These equations are known as the Wiener–Hopf equations. The matrix T ap-

5

pearing in the equation is a symmetric Toeplitz matrix. Under suitable con-ditions on R , these matrices are known to be positive definite and thereforenon-singular yielding a unique solution to the determination of the Wienerfilter coefficient vector,a = T−1vIt is this equation that makes it necessary to design a Matrix Inversion Hard-ware that is faster than the existing ones so that there is less delay in imageprocessing and also generalization to NxN form. The inversion of the matrixwill be done in this paper using QR decomposition using Givens Rotation

4 Q-R decomposition of a matrix

QR Decomposition: QR decomposition is one of the most important opera-tions in linear algebra. It can be used to find matrix inversion, to solve a set ofsimulations equations or in numerous applications in scientific computing. Itrepresents one of the relatively small numbers of matrix operation primitivefrom which a wide range of algorithms can be realized. QR decompositionis an elementary operation, which decomposes a matrix into an orthogonaland a triangular matrix. QR decomposition of a real square matrix A is adecomposition of A as A = QR, where Q is an orthogonal matrix (QT Q =I) and R is an upper triangular matrix. And we can factor m x n matrices(with m n) of full rank as the product of an m x n orthogonal matrix whereQT Q = I and an n x n upper triangular matrix. There are different meth-ods which can be used to compute QR decomposition. The techniques forQR decomposition are Gram-Schmidt ortho-normalization method, House-holder reflections, and the Givens rotations. Each decomposition method hasa number of advantages and disadvantages because of their specific solutionprocess.The Givens’ Rotation Technique has been discussed

If there are two nonzero vectors, x and y, in a plane, the angle, θ, betweenthem can be formalized as :

cos(θ)= (x,y)||x||2||y||2

The rotation will be performed using 16 bit pipelined CORDIC.This formula can be extended to n vectors. The angle, θ , can be defined

as

6

θ=arccos (x,y)||x||2||y||2

((A−1)−1

)=AA=QR where R is an upper triangular matrix and R is an orthogonal matrix.I=QQT

Consider a 4X4 system

A =

a1,1 a1,2 a1,3 a1,4a2,1 a2,2 a2,3 a2,4a3,1 a3,2 a3,3 a3,4a4,1 a4,2 a4,3 a4,4

R =

a1,1 a1,2 a1,3 a1,40 a2,2 a2,3 a2,40 0 a3,3 a3,40 0 0 a4,4

The matrix of Givens Rotation is

G(i,j, θ) =

1 0 0 00 cos(θ) sin(θ) 00 −sin(θ) cos(θ) 00 0 0 1

Givens Rotation process utilizes a cycle of rotation whose function is tonull an element in the sub-diagonal of the matrix forming the QR matrix. Qmatrix is obtained by concatenating all the Givens Rotation.R is to be found from three rotation where each element is obtained fromeach rotation. Givens Rotation matrices needed for a 3x3 system

G1 =

cos(θ) 0 sin(θ)0 1 0

−sin(θ) 0 cos(θ)

G2 =

cos(θ) sin(θ) 0−sin(θ) cos(θ) 0cos(θ) cos(θ) 1

G3 =

1 0 0cos(θ) cos(θ) sin(θ)cos(θ) −sin(θ) cos(θ)

θ, A(3,1) , A(2,1), A(3,2) can be obtained using

c1 = A1(1,1)√A1(3,1)

2+A1(1,1)2

7

c2 = A1(1,1)√A1(2,2)

2+A1(3,2)2

c3 = A1(1,1)√A1(2,2)

2+A1(3,2)2

s1 = A1(3,1)√A1(3,1)

2+A1(1,1)2

s2 = A1(2,1)√A1(2,1)

2+A1(1,1)2

s3 = A1(3,2)√A1(2,2)

2+A1(3,2)2

Q = G1T .G2

T .G3T

A2 = G1A1

A3 = G2A2

R = G3A3

A = QRA−1 = (QR)−1

A−1 = (R)−1(Q)−1

A−1 = (R)−1(Q)T

This nececitates the formation of the inverse of the upper triangular ma-trix and it’s subsequent multiplication to the transpose of the orthogonalmatrix.

Figure 1: Basic Hardware for matrix inversion using QR decomposition.TheG matrix is formed using Givens Rotation performed using CORDIC

8

5 Hardware for inversion of an upper trian-

gular matrix(R)

We have designed the hardware for inversion of a generalised N X N upper

triangular matrix R. where R=

r1,1 r1,2 · · · r1,n0 r2,2 · · · r2,n...

.... . .

...0 0 · · · rn,n

Let B be (R)−1. The algorithm is as followed

1 f o r ( row=1;row<=n ; row++)2 B( row , row )=1/R( row , row )3 next row4 f o r ( row=1;row<=n ; row++)5 f o r ( c o l=row+1; co l<=n ; c o l++)6 s=07 f o r ( k=1;k<=col −1;k++)8 s=s+B( row , k )R(k , c o l )9 s=−s /R( co l , c o l )

10 B( row , c o l )=s11 next k12 next c o l13 next row

We observe that the inverse of the upper triangular matrix is also anupper triangular matrix with the diagonal elements reciprocal of the diag-onal elements of the original matrix. The inverse of the other elements arecalculated recursively using the algorithm as mentioned above. An exampleto illustrate how the algorithm works is shown below. Let A be an uppertriangular matrix and B be its inverse then

A=

a1,1 a1,2 a1.3 · · · r1,n0 a2,2 a2,3 · · · a2,n0 0 a3,3 · · · a3,n...

......

. . ....

0 0 0 · · · an,n

B=

b1,1 b1,2 ab1.3 · · · br1,n0 b2,2 b2,3 · · · b2,n0 0 b3,3 · · · b3,n...

......

. . ....

0 0 0 · · · bn,n

Since AB=I

9

a1,1 a1,2 a1.3 · · · r1,n0 a2,2 a2,3 · · · a2,n0 0 a3,3 · · · a3,n...

......

. . ....

0 0 0 · · · an,n

b1,1 b1,2 ab1.3 · · · br1,n0 b2,2 b2,3 · · · b2,n0 0 b3,3 · · · b3,n...

......

. . ....

0 0 0 · · · bn,n

=

1 0 0 · · · 00 1 0 · · · 00 0 1 · · · 0...

......

. . ....

0 0 0 · · · 1

Multiplying the ith row of matrix A with the ith column of B yields ai,ibi,i=1.Hence we see that bi,i = 1

ai,i

Now to solve for the non diagonal elements of the matrix B. We multiply thefirst row and second column first to get a1,1b1,2+a1,2b2,2=0. We already knowthw value of b2,2 So the only unknown is b1,2. Now in general to obtain thevalue of bi,j we multiply the ith row of A and the jth column of B and equatethat to 0 proceeding in a proper sequence of steps so that the values of b thatare needed to do the forward substitution are obtained from beforehand.

5.1 Storage in a RAM

In any matrix total number of elements = n x n=n2. In the upper triangularmatrix generated here the number of non-zero elements is n(n−1)

2since the

rest of the elements are zero in the bottom left triangle.So for minimisationof hardware we have come up with an algorithm to omit storage of the zerosin the RAM. If the zeros were not omitted the position of the element ri,jwould be j + (i-1)x n. However since this is not the case we are required todevelop an algorithm to generate the RAM location address for given i, j andn

5.2 Address generation Mechanism

As in the upper triangular matrix the ri,j = 0 for i<j; there would no need forstoring them as zeroes individually in the RAM, instead we could just omitthe zeroes and find the position in the RAM corresponding to inputs (i,j)that is ri,j would be given and a corresponding location in the RAM wouldbe obtained in our mechanism where zeroes are not stored, the address inthe RAM for ri,j would be equal to

n(i-1)+j- i(i−1)2

-1.Now this formula is obtained from the fact that in the actual system wewould have the address of the element ri,j as j + (i-1)x n but this time for

10

each row we are omitting i zeros, so the cumulative number of zeros omittedis∑i

k=1 k

Figure 2: Block diagram of the address generation block

Figure 3: Circuit diagram of the address generation block

Hardware Required :

11

4 adders/subtractors2 multipliers1 bit right shifter

5.3 Hardware for finding the inverse of diagonal ele-ments

The following circuit (Figure 4) can be used for inversion of the diagonalelements of the upper triangular matrix. The circuit consists of a loadableup counter that counts till the number of rows in the matrix. Hence thecomparator to indicate that this process needs to stop when the value n isreached. The circuit then sends value to the address generator block of RAMA and then the same address is sent to RAM B so that the data is modifiedin the same location in both RAM and RAM B.Hardware Required :1 Loadable Up Counter1 Comparator1 Inverter Block that computes the inverse of a 16 bit number.Time Required :Same as n clock pulses

12

Figure 4: Schematic hardware design for inversion of diagonal elements

5.4 Hardware for the finding the inverse of the otherelements

The following circuit(Figure 5) can be used for diagonalizing all elementsother than the diagonal elements.Hardware Required :3 Loadable Up Counter4 address generation blocks1 divider1 multiplier4 adders/subtractors1 RegisterNecessary control circuits for termination of loopsNo. of clock cycles needed :O(n2)

13

Figure 5: Schematic hardware design for inversion of elements other thanthose lying in the principal diagonal

14

6 Conclusions

6.1 Multi PORT RAM for faster performance

One of the obstacles in the way of obtaining high performance in computingis the memory-wall . If the processing elements cannot get the data from reg-ister file (RF) at the processing rate, this causes a bottleneck that adverselyaffects the overall performance. In order to meet the requirement of properdata usage between the computational units, such a computation systemneeds a register file that can meet the requirements of different computingunits on the FPGA. The demand to process more data per unit time requiresmultiple read and write operations at a time, which can be achieved by theusage of multi-port register files (MPo-RFs) instead of conventional single-port RFs (SPo-RF).Multi-ported memories are challenging to implement onFPGAs since the block RAMs included in the fabric typically have only twoports. Hence we must construct memories requiring more than two portseither out of logic elements or by combining multiple block RAMs. SomeConventional Multi-Port Register File Implementations that can be used:1. Distributed Memory2. Replication3. Banking4. Multi-pumping

6.2 Distributed Arithmetic for computing the productof the two matrices

Distributed arithmetic is a technique developed for the real-time computationof the inner product of the vector with constant elements and the vectorwith varying coefficients. The inner product is computed without splittinginto operations of multiplication and addition. At calculation, operationsof summation and shift of inner products of an unchangeable vector and abit-slice of a changeable vector are carried out. All possible values of partialinner products are calculated offline and written down in Look Up Table(LUT).The content of LUT is computed dynamically in the online mode.Contents of this memory remain invariable for the period of multiplication ofthe left matrix by a column of the right matrix. Despite need of calculationof contents of LUT total number of micro-operations of addition decreases

15

Figure 6: 4 Read + 1 Write block RAM as an example of Multiport RAM

in comparison with a classical way of calculation of matrix product.

16

7 Acknowledgements

The authors would like to thank their Project Guide Dr. Ayan Banerjeefor his invaluable suggestions and proper direction throughout the courseof the project. Thankfulness and heartfelt gratitude is also extended to Mr.Anirban Chakraborty who is currently pursuing his Ph.D under the guidanceof Prof. Ayan Banerjee.

References

[1] Gonzalez, R. C., Woods, R. E. (2002). Digital image processing. UpperSaddle River, NJ: Prentice Hall.

[2] Seyid K, Blanc S, Leblebici Y Hardware Implementation of Real-TimeMultiple Frame Super-Resolution eyid Very Large Scale Integration(VLSI-SoC), 2015 IFIP/IEEE International Conference on

[3] Matrix Inversion Using QR Decomposition by Parabolic Synthesis NafizAhmed Chisty—

[4] Brown, Robert Grover; Hwang, Patrick Y.C. (1996). Introduction to Ran-dom Signals and Applied Kalman Filtering (3 ed.). New York: John WileySons. ISBN 0-471-12839-2.

[5] D. Boulfelfel, R.M. Rangayyan, L.J. Hahn, and R. Kloiber, 1994, ”Three-dimensional restoration of single photon emission computed tomographyimages”, IEEE Transactions on Nuclear Science, 41(5): 1746-1754, Octo-ber 1994

[6] Wiener, Norbert (1949). Extrapolation, Interpolation, and Smoothing ofStationary Time Series. New York: Wiley. ISBN 0-262-73005-7.

[7] Thomas Kailath, Ali H. Sayed, and Babak Hassibi, Linear Estimation,Prentice-Hall, NJ, 2000, ISBN 978-0-13-022464-4.

[8] Wiener N: The interpolation, extrapolation and smoothing of stationarytime series’, Report of the Services 19, Research Project DIC-6037 MIT,February 1942

17

[9] Kolmogorov A.N: ’Stationary sequences in Hilbert space’, (In Russian)Bull. Moscow Univ. 1941 vol.2 no.6 1-40. English translation in KailathT. (ed.) Linear least squares estimation Dowden, Hutchinson Ross 1977

[10] Vladislav Lesnikov, Tatiana Naumovich, Alexander Chastikov, ”Modifi-cation of the architecture of a distributed arithmetic”, East-West DesignTest Symposium (EWDTS) 2015 IEEE, pp. 1-4, 2015.

[11] Tips Tricks: Creating a 2W+4R FPGA Block RAM, Part 1 AlvaroLopes, Senior Software engineer, Critical Software

[12] An Efficient FPGA Implementation of Scalable Matrix Inversion Coreusing QR Decomposition

18