ON HYBRID EXACT-APPROXIMATEJOINT DIAGONALIZATIONarie/Files/camsap09.pdfQnbn, which has to be satisfied for each n == 1, ...,N - where II bn11 2 ~b~»; denotes the squared norm ofbn,

2009 3rd IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing

ON HYBRID EXACT-APPROXIMATE JOINT DIAGONALIZATION

Arie Yeredor

School of Electrical Engineering, Tel-Aviv UniversityP.O.Box 39040, Tel-Aviv 69978, Israel

[email protected]

ABSTRACT

We consider a particular form of the classical approximatejoint diagonalization problem, often encountered in Maximum Likelihood source separation based on second-orderstatistics with Gaussian sources. In this form the numberof target-matrices equals their dimension, and the joint diagonality criterion requires that in each transformed ("diagonalized") target-matrix, all off-diagonal elements on onespecific row and column be exactly zeros, but does not careabout the other (diagonal or off-diagonal) elements. Weshow that this problem always has a solution for symmetric, positive-definite target-matrices and present some interesting alternative formulations. We review two existing iterative approaches for obtaining the diagonalizing matricesand propose a third one with faster convergence.

1. INTRODUCTION

The general framework of Approximate Joint Diagonalization (AJD), which is closely related to the problem of BlindSource Separation (BSS), considers a set of K (typicallymore than two) square, symmetric, real-valued N x N matrices denoted Q1' ... , QK' often termed the "target matrices". The goal is to find a single matrix B (or its inverseA, see below) which best "jointly diagonalizes" the targetmatrices in some sense. In the BSS context, the diagonalizing B serves as an estimate of the demixing matrix, whichis subsequently used for recovering the sources from theirobserved mixtures. A serves, in tum, as an estimate of theunknown mixing matrix.

Quite a few approaches to the AJD problem have beensuggested in the past two decades, mainly differing in theproposed criteria for measuring the extent of attained jointdiagonalization. These can be generally divided into "direct" criteria, looking for B which makes all BQkBT "asdiagonal as possible"; and "indirect" criteria, looking for A(and K diagonal matrices D k ) such that all Qk are "bestfitted" with ADkAT. When B (or A) is restricted (such asin [1]) to be orthonormal, the problem is commonly referredto as orthogonal AJD, otherwise it is non-orthogonal AJD.

978-1-4244-5180-7/09/$26.00 ©2009 IEEE 312

In the context of Maximum Likelihood (ML) or QuasiML (QML) [3] source separation based on Second OrderStatistics (SOS) for Gaussian sources, the likelihood equations (often also called "estimating equations" in this context) take an interesting form, which can be interpreted asa very particular case of (non-orthogonal) AJD. The equations can be viewed as requiring the optimization ofa hybridexact-approximate AJD criterion, satisfied both in the "direct" and in the "indirect" formulation. We shall refer to theproblem of solving these equations as the "Hybrid ExactApproximate Diagonalization" (HEAD) problem.

General AJD is basically an ad-hoc tool which attemptsto "best fit" a prescribed model to the set of target matrices. HEAD, however, enjoys (with some particular choicesof target-matrices) the property of asymptotic optimality ofthe resulting estimate (of either the demixing or mixing matrix) in the relevant BSS context. Indeed, as shown in [5, 4],general AJD can be made asymptotically optimal as well,by use of proper weighting. However, the same asymptoticoptimality appears in a much more "natural" and computationally simpler way in HEAD, since, as mentioned, HEADcan directly attain the ML estimate of A or B in such cases.

In Section 2 we formulate and characterize the HEADproblem and obtain conditions for existence of a solution.In Section 3 we briefly review two existing solution approaches and propose a third one, with demonstrated improved convergence and computational advantages.

2. HYBRID EXACT-APPROXIMATEDIAGONALIZATION (HEAD)

While in classical AJD the number K of target-matrices isarbitrary (the only restriction is usually K > 2), one of themain features of HEAD is that K must equals the matrices'dimensions, namely K == N. Indeed, consider N targetmatrices Q1 , ...QN' each ofdimensions N x N. The HEADproblem can be stated as:Pi: Given N target-matrices Q1' ...QN' find an N x Nmatrix B == [b1 b2 ... bN]T, such that

b~Qnbn == 8mn Vm, n E {I, ... ,N}, (1)

For all nonsigular B, C(B) is continuous, differentiableand bounded from above:

1 NC(B) =logldetBI-"2 Le;BQnBTeno (4)

n=l

Since these equations are nonlinear, a (real-valued) solutionmayor may not exist in general, and mayor may not beunique. We shall show, however, that if all the N targetmatrices are positive-definite (PD), a solution must exist(but we do not have an explicit condition for uniqueness).

Indeed, let Q1' ... ' QN denote a set of (symmetric) PDtarget matrices, and let An denote the smallest eigenvalue ofQn' n == 1, ... ,N. Consider the function

(5)

1 NC(B) = log Idet BI - "2 L b;o.»;

n=lN 1 N

< log II Ilbnll - "2 L Anb;bnn=l n=lN

= ~ L {log IIbnl1 2- An llbn l1

2}

n=l

where en denotes the n-th column of the N x N identitymatrix I.

In other words, each transformed matrix BQnBTshould be exactly "diagonal" in its n-th column (and, sinceit is symmetric, also in its n-th row), in the sense that alloff-diagonal elements in these row and column must be exactly zero. All other elements may take arbitrary (nonzero)values. In addition, with the problem formulations above wealso require that the diagonal (n, n)-th element of BQnBTbe 1 - but this is merely a scaling constraint on the rows ofB - once any matrix B satisfying the exact off-diagonalitycriteria is found, it is straightforward to simply rescale eachof its rows such that b~Qn bn == 1, without any effect onthe off-diagonality property.

Multiplying both sides of (2) by A == B- 1 on the leftwe obtain

where 8mn denotes Kronecker's deltafunction (which is 1 ifm == nand 0 otherwise).Equivalently, the same problem can be stated as:P2: Given N target-matrices Q1' ...QN' find an N x Nmatrix B, such that

where E nm ~ enern denotes an all-zeros matrix with anonly 1 at the (n, m )-th location. By concatenating theseequations for m == 1, ... , N into a vector we get an ==Qnbn, which has to be satisfied for each n == 1, ... , N -

where II bn 112 ~ b~»; denotes the squared norm of bn, and

where we have used the properties

1. Idet BI :S n:=l IIbn ll (Hadamard's inequality);

2. b~Qnbn 2: An ll bn l/2

; and

3. log x - AX :S -log A-I for all x > o.

Consequently, noting also that C(a . B)~ -00 VB,C(B) attains a maximum for some nonsingular B, and itsderivative with respect to B at the maximum point vanishes.Indeed, differentiating C(B) with respect to B(n,m) (the(n, m)-th element of B) and equating zero we get (for allm,n E {I, ... ,N})

1 N< "2 L {- log An - I},

n=l

QnBT en == Aen =} an == Qnbn, n == 1, ,N (3)

where an denotes the n-th column of A == [b1 aN]. Inother words, the same problem can be stated as follows:P3: Given N target-matrices Q1' ...Q N' find two reciprocal N x N matrices B and A == B-1

, such that the n-thcolumn ofA is given by Qnbn, with b~ being the n-th rowofB, n == 1, ... , N.

Assuming that all target-matrices are invertible, we maynow swap the roles between B and A, obtaining that the nth column bn of B T is should be given by Q:;;la n, wherean denotes the n-th row of AT, n == 1, ... ,N. This meansthat the same problem formulations P1 and P2 above maybe cast in terms of AT (instead of B) with the inverses ofthe target-matrices substituting the target matrices. This implies that the "direct" and "indirect" formulations of HEADcoincide: If B is the HEAD diagonalizer of Q1, ... , QN'

then its transposed inverse A T is the HEAD diagonalizerof the inverse set Ql 1

, ... ,QN1. It is important to note

that this desirable "self-reciprocity" property, is generallynot shared by other non-orthogonal AJD algorithms. Infact, it is easy to show that this property is satisfied in nonorthogonal AJD when (and only when) the target matricesare exactly jointly diagonalizable. In general, however, thetarget matrices are not exactly jointly diagonalizable, sincethey are merely estimates of a set of such matrices. Nevertheless, the HEAD solution always enjoys the "self reciprocity", reflecting some kind of "self-consistency".

As obvious, e.g., from P1, HEAD requires the solution of N 2 equations in N 2 unknowns (the elements of B).

8C(B)

8B(n,m)

N1", T T== A(m,n) - "2 Z:: 2ek EnmQkB ei;

k=lN

== A(m,n) - L 8kne~QkBTekk=l

== A(m,n) - e~Qnbn == 0

(6)

313

3. SOLUTION OF HEAD

3.1. Multiplicative Updates (MU)

The basic idea in this approach is to iteratively update Busing multiplicative updates of the form

Unlike classical AJD, the HEAD problem and its solutionshave rarely been addressed in the literature. To the best ofour knowledge, so far only two different iterative algorithmshave been proposed (both in the context of ML or QMLBSS): One by Pham and Garat [3], which is based on multiplicative updates of B, and the other by Degerine and Zaidi[2], which is based on alternating oblique projections withrespect to the columns of B.

Rewriting this equation with the roles of m and n exchanged, we obtain two equations for the two unknownsP(n, m) and P(m, n), thus decoupling the original set ofdimensions N(N - 1) x N(N - 1) into N(N - 1)/2 setsof dimensions 2 x 2 each. The procedure is repeated iteratively until convergence.

where iJ(n) denotes the matrix B without its n-th row (an(N - 1) x N matrix). This procedure (applied sequentiallyfor n == 1, ... , N) is repeated iteratively until convergence.Note that due to the "self-reciprocity" property of HEAD,an equivalent solution can be obtained by applying the sameprocedure to the columns of A, with the matrices Qn replaced by their inverses.

3.2. Iterative Relaxation (IR) - Alternating Directions

Consider a specific value of n, and define a scalar product(u, u)Qn ~ uTQnu. Evidently, (1) implies that (with re-

spect to this scalar product) bn has unit-norm and is orthogonal to all other rows of B. Thus, assuming all other rowsof B are fixed, bn is updated by (oblique) projection ontothe subspace orthogonal to all other rows, followed by scalenormalization:

(7)B f-- (1 - pT)B,

as required in formulation P3 above. This means that thesolution of HEAD can be expressed as the maximizer ofC(B), which, as mentioned above, always exists when thetarget matrices are all PD.

Naturally, this derivation is closely related to the factthat HEAD yields the ML (or QML) estimation of thedemixing matrix in some specific BSS contexts (e.g., [3,2])with some specific target-matrices. However, we obtainedhere a more general result, which holds for any set of PDtarget matrices, and not only for the specific matrices usedfor ML or QML estimation in [3, 2].

~ ~ ~

Qn(n,n)p(n,m) + Qn(m,m)p(m,n) == Qn(n,m), n -I- m,(11)

The implied update of the HEAD equations (1) in terms ofthe P can now be expressed as

T T ~

en (1 - P )Qn(1 - P)em == 8mn , (9)

which, when neglecting terms which are quadratic in the elements of P and ignoring the scaling equations (those withm == n), can also be written as

T~ T ~ ~

en QnPm + emQnPn == Qn(n,m), n -I- m (10)

where Pi denotes the i-th column of P, and where Qn(n,m)

denotes the (n, m)-th element ofOn. This yields N(N -1)linear equations for the N (N - 1) off-diagonal elements ofP, whose diagonal terms are arbitrarily set to zero.

Further simplification may be obtained if we assumethat all Qn are already "nearly diagonal", so that the offdiagonal elements in each row are negligible with respectto the diagonal element in that row. Under this assumption,the linear equations (10) can be further approximated as

6. B2C(B )H(ix(m,n),ix(p,q)) == BB BB

(m,n) (p,q)

== 8B8

{A(n,m) ~ e~Qmbm}(p,q)

== -e~AEpqAem - e~Qmeq ·8mp

== -A(n,p)A(q,m) - Qm(n,q) ·8mp, (13)

(where we have used BA == -A . BB . A). The key observation here, is that if we differentiate at B == 1, thenH becomes considerably sparse, since at B == 1 we haveA(n,p)A(q,m) == 8np8qm. The solution ofthe associated system ofN 2 x N 2 equations (for the update ofall N 2 elements

3.3. The proposed approach - Newton's method withconjugate gradient (NeG)

We propose to apply Newton't method to the nonlinearequations (6). To this end, let us define the N 2 x N 2 Hessianmatrix H as follows. First, we define the indexing function

ix(m, n) ~ (m - l)N + n, which determines the locationof B(m,n) in vec(BT) (the concatenation of the columns of

B T into an N 2 x 1 vector). Then, differentiating (6) withrespect to B(p,q) we get (for all m, n, P, q E {I, ... , N})

(8)

where P is a "small" update matrix. Let us denote

~ t::. TQ n == BQnB , n == 1, ... ,N

314

- 10

- 15

N=5, arbitrary matrices

4. REFERENCES

N=20, arbitra ry matr ices

50 100number of iterati ons

Figure 1: rms error for MU, IR and NCG with AJD matrices, N = 3,9(left) and for IR and NCG with arbitrary matrices, N = 5,20 (right).

Fig. 1 shows typical convergence patterns. We distinguish between two cases: one in which all Qn are indeedapproximately jointly diagonalizable (termed "AJD matrices" in the figure) and one in which theey are arbitrary (symmetric PD) matrices (termed "arbitrary matrices"). The reason for the distinction is that the MU algorithm (with (11))inherently assumes that the matrices are nearly jointly diagonalizable. Recall, however, that this is not a necessarycondition for the existence of a solution to the HEAD problem - and we therefore also present the case of arbitrarytarget-matrices, but in these cases the MU algorithm doesnot converge, and is therefore omitted from the plots.

The figure shows the evolution of the log of the residual root-mean-squares (rms) HEAD diagonalization errorvs. iteration number. In the left-hand side plots (for AJDmatrices) we show all three algorithms, which were all similarly initialized with the (exact) joint diagonalizer of Qland Q2' In the right-hand side plots (for arbitrary matrices) we show the IR algorithm (initialized with B = 1) andthe NCG algorithm, initialized with the output of the IR algorithm obtained as soon as the log rms error falls below- 2.5. It is evident that the transition to the NCG algorithm,with its quadratic convergence, significantly accelerates theconvergence. This is obtained at the cost of only a moderateincrease in the computational load per iteration: IR requiresO(N4 ) multiplications per iteration, whereas NCG requiresapproximately O( N 5 ) .

I. Update the transformed target matrices

2. Construct the deviation £ of the updated gradientequations (6) at B = I from zero,

of B) can then be attained with relative computational simplicity using the conjugate gradient method (which exploitsthis sparsity). Note, indeed, that with B = 1 we have

3. Find the correction matrix a, given by

where the Bdiagj -) operator creates a block-diagonal matrix from its matrix arguments, and where P is merely apermutation matrix transforming the vecf) of a matrix intothe veci -) of its transpose, namely for any N x N matrix X , we have Pvec(X) = vec(XT

) (note also thatP = p T = p - l). Therefore, the operation of H on Xcan be easily expressed as:

Use the conjugate gradient method', exploiting thesparsity o!H by the use of (15) (with each Qn substi

tuted by Qr;) for calculating expressions of the formH 'vec(X ) along the way.

4. Apply the correction B f- (1 + a)B

where X l , .. . X N denote the columns of X.Luckily, the structure of the problem enables us to al

ways work in the vicinity of B = 1 , as each update of Bcan be translated into transformation of the target matrices,defining a "new" problem in terms of the transformed matrices. Then, B = 1 can again be conveniently used as an" initial guess" for the new problem. Summarizing our algorithm, given the target matrices and some initial guess ofB ,we repeat the following until convergence

This algorithm is similar in structure to the full version (l0)of the multiplicative updates algorithm above. However, itis based on an iterative solution of (6) (as opposed to (1)),which conveniently lends itself to the use of a conjugategradient algorithm in each Newton iteration, by exploitingthe sparsity of H.

1Conjugate gradient requires - H to be PD. Although

Bdiag(01 , . . . ON ) is I'D, - H is generally not I'D, since P isnot PD. We may thus use the eonjugate-gradient-squared algorithm.

[I] J.-F. Cardoso, A. Souloumiac, "Jacobi angles for simultaneous diagonalization", SIAMJ. on Matrix Anal. andAppl., 17:161-164, 1996.

[2] S. Degerine and A. Zaidi, "Separation of an Instantaneous Mixture of GaussianAutoregressive Sources by the Exact Maximum Likelihood Approach", IEEETr.Signal Processing, 52(6):1492-1512, 2004.

[3) D.-T. Pharo and Ph. Garat, "Blind Separation of Mixtures of IndependentSources Through a Quasi Maximum Likelihood Approach," IEEE Trans. SignalProcessing, 45(7):1712-1725 , 1997.

[4] P. Tichavsky, A. Yeredor, "Fast Approximate Joint Diagonalization Incorporating Weight Matrices," IEEE Tr. Signal Processing, 57(3):878-891, 2009.

[5] A. Yeredor, "Blind separation of Gaussian sources via second-order statistics with asymptotically optimal weighting," IEEE Signal Processing Letters,7(7):197-200,2000.

315

Documents

ON HYBRID EXACT-APPROXIMATEJOINT DIAGONALIZATIONarie/Files/camsap09.pdfQnbn, which has to be satisfied for each n == 1, ...,N - where II bn11 2 ~b~»; denotes the squared norm ofbn,