4
2009 3rd IEEEInternational Workshop on Computational Advances in Multi-Sensor Adaptive Processing ON HYBRID EXACT-APPROXIMATE JOINT DIAGONALIZATION Arie Yeredor School of Electrical Engineering, Tel-Aviv University P.O.Box 39040, Tel-Aviv 69978, Israel [email protected] ABSTRACT We consider a particular form of the classical approximate joint diagonalization problem, often encountered in Maxi- mum Likelihood source separation based on second-order statistics with Gaussian sources. In this form the number of target-matrices equals their dimension, and the joint di- agonality criterion requires that in each transformed ("diag- onalized") target-matrix, all off-diagonal elements on one specific row and column be exactly zeros, but does not care about the other (diagonal or off-diagonal) elements. We show that this problem always has a solution for symmet- ric, positive-definite target-matrices and present some inter- esting alternative formulations. We review two existing it- erative approaches for obtaining the diagonalizing matrices and propose a third one with faster convergence. 1. INTRODUCTION The general framework of Approximate Joint Diagonaliza- tion (AJD), which is closely related to the problem of Blind Source Separation (BSS), considers a set of K (typically more than two) square, symmetric, real-valued N x N ma- trices denoted Q1' ... , QK' often termed the "target matri- ces". The goal is to find a single matrix B (or its inverse A, see below) which best "jointly diagonalizes" the target matrices in some sense. In the BSS context, the diagonaliz- ing B serves as an estimate of the demixing matrix, which is subsequently used for recovering the sources from their observed mixtures. A serves, in tum, as an estimate of the unknown mixing matrix. Quite a few approaches to the AJD problem have been suggested in the past two decades, mainly differing in the proposed criteria for measuring the extent of attained joint diagonalization. These can be generally divided into "di- rect" criteria, looking for B which makes all BQkBT "as diagonal as possible"; and "indirect" criteria, looking for A (and K diagonal matrices D k ) such that all Qk are "best fitted" with ADkAT. When B (or A) is restricted (such as in [1]) to be orthonormal, the problem is commonly referred to as orthogonal AJD, otherwise it is non-orthogonal AJD. 978-1-4244-5180-7/09/$26.00 ©2009 IEEE 312 In the context of Maximum Likelihood (ML) or Quasi- ML (QML) [3] source separation based on Second Order Statistics (SOS) for Gaussian sources, the likelihood equa- tions (often also called "estimating equations" in this con- text) take an interesting form, which can be interpreted as a very particular case of (non-orthogonal) AJD. The equa- tions can be viewedas requiring the optimization of a hybrid exact-approximate AJD criterion, satisfied both in the "di- rect" and in the "indirect" formulation. We shall refer to the problem of solving these equations as the "Hybrid Exact- Approximate Diagonalization" (HEAD) problem. General AJD is basically an ad-hoc tool which attempts to "best fit" a prescribed model to the set of target matri- ces. HEAD, however, enjoys (with some particular choices of target-matrices) the property of asymptotic optimality of the resulting estimate (of either the demixing or mixing ma- trix) in the relevant BSS context. Indeed, as shown in [5, 4], general AJD can be made asymptotically optimal as well, by use of proper weighting. However, the same asymptotic optimality appears in a much more "natural" and computa- tionally simpler way in HEAD, since, as mentioned, HEAD can directly attain the ML estimate of A or B in such cases. In Section 2 we formulate and characterize the HEAD problem and obtain conditions for existence of a solution. In Section 3 we briefly review two existing solution ap- proaches and propose a third one, with demonstrated im- proved convergence and computational advantages. 2. HYBRID EXACT-APPROXIMATE DIAGONALIZATION (HEAD) While in classical AJD the number K of target-matrices is arbitrary (the only restriction is usually K > 2), one of the main features of HEAD is that K must equals the matrices' dimensions, namely K == N. Indeed, consider N target- matrices Q 1, ... QN' each of dimensions N x N. The HEAD problem can be stated as: Pi: Given N target-matrices Q1' ... QN' find an N x N matrix B == [b 1 b 2 ... bN]T, such that == 8 mn Vm, n E {I, ... ,N}, (1)

ON HYBRID EXACT-APPROXIMATEJOINT DIAGONALIZATIONarie/Files/camsap09.pdfQnbn, which has to be satisfied for each n == 1, ...,N - where II bn11 2 ~b~»; denotes the squared norm ofbn,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ON HYBRID EXACT-APPROXIMATEJOINT DIAGONALIZATIONarie/Files/camsap09.pdfQnbn, which has to be satisfied for each n == 1, ...,N - where II bn11 2 ~b~»; denotes the squared norm ofbn,

2009 3rd IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing

ON HYBRID EXACT-APPROXIMATE JOINT DIAGONALIZATION

Arie Yeredor

School of Electrical Engineering, Tel-Aviv UniversityP.O.Box 39040, Tel-Aviv 69978, Israel

[email protected]

ABSTRACT

We consider a particular form of the classical approximatejoint diagonalization problem, often encountered in Maxi­mum Likelihood source separation based on second-orderstatistics with Gaussian sources. In this form the numberof target-matrices equals their dimension, and the joint di­agonality criterion requires that in each transformed ("diag­onalized") target-matrix, all off-diagonal elements on onespecific row and column be exactly zeros, but does not careabout the other (diagonal or off-diagonal) elements. Weshow that this problem always has a solution for symmet­ric, positive-definite target-matrices and present some inter­esting alternative formulations. We review two existing it­erative approaches for obtaining the diagonalizing matricesand propose a third one with faster convergence.

1. INTRODUCTION

The general framework of Approximate Joint Diagonaliza­tion (AJD), which is closely related to the problem of BlindSource Separation (BSS), considers a set of K (typicallymore than two) square, symmetric, real-valued N x N ma­trices denoted Q1' ... , QK' often termed the "target matri­ces". The goal is to find a single matrix B (or its inverseA, see below) which best "jointly diagonalizes" the targetmatrices in some sense. In the BSS context, the diagonaliz­ing B serves as an estimate of the demixing matrix, whichis subsequently used for recovering the sources from theirobserved mixtures. A serves, in tum, as an estimate of theunknown mixing matrix.

Quite a few approaches to the AJD problem have beensuggested in the past two decades, mainly differing in theproposed criteria for measuring the extent of attained jointdiagonalization. These can be generally divided into "di­rect" criteria, looking for B which makes all BQkBT "asdiagonal as possible"; and "indirect" criteria, looking for A(and K diagonal matrices D k ) such that all Qk are "bestfitted" with ADkAT. When B (or A) is restricted (such asin [1]) to be orthonormal, the problem is commonly referredto as orthogonal AJD, otherwise it is non-orthogonal AJD.

978-1-4244-5180-7/09/$26.00 ©2009 IEEE 312

In the context of Maximum Likelihood (ML) or Quasi­ML (QML) [3] source separation based on Second OrderStatistics (SOS) for Gaussian sources, the likelihood equa­tions (often also called "estimating equations" in this con­text) take an interesting form, which can be interpreted asa very particular case of (non-orthogonal) AJD. The equa­tions can be viewed as requiring the optimization ofa hybridexact-approximate AJD criterion, satisfied both in the "di­rect" and in the "indirect" formulation. We shall refer to theproblem of solving these equations as the "Hybrid Exact­Approximate Diagonalization" (HEAD) problem.

General AJD is basically an ad-hoc tool which attemptsto "best fit" a prescribed model to the set of target matri­ces. HEAD, however, enjoys (with some particular choicesof target-matrices) the property of asymptotic optimality ofthe resulting estimate (of either the demixing or mixing ma­trix) in the relevant BSS context. Indeed, as shown in [5, 4],general AJD can be made asymptotically optimal as well,by use of proper weighting. However, the same asymptoticoptimality appears in a much more "natural" and computa­tionally simpler way in HEAD, since, as mentioned, HEADcan directly attain the ML estimate of A or B in such cases.

In Section 2 we formulate and characterize the HEADproblem and obtain conditions for existence of a solution.In Section 3 we briefly review two existing solution ap­proaches and propose a third one, with demonstrated im­proved convergence and computational advantages.

2. HYBRID EXACT-APPROXIMATEDIAGONALIZATION (HEAD)

While in classical AJD the number K of target-matrices isarbitrary (the only restriction is usually K > 2), one of themain features of HEAD is that K must equals the matrices'dimensions, namely K == N. Indeed, consider N target­matrices Q1 , ...QN' each ofdimensions N x N. The HEADproblem can be stated as:Pi: Given N target-matrices Q1' ...QN' find an N x Nmatrix B == [b1 b2 ... bN]T, such that

b~Qnbn == 8mn Vm, n E {I, ... ,N}, (1)

Page 2: ON HYBRID EXACT-APPROXIMATEJOINT DIAGONALIZATIONarie/Files/camsap09.pdfQnbn, which has to be satisfied for each n == 1, ...,N - where II bn11 2 ~b~»; denotes the squared norm ofbn,

For all nonsigular B, C(B) is continuous, differentiableand bounded from above:

1 NC(B) =logldetBI-"2 Le;BQnBTeno (4)

n=l

Since these equations are nonlinear, a (real-valued) solutionmayor may not exist in general, and mayor may not beunique. We shall show, however, that if all the N target­matrices are positive-definite (PD), a solution must exist(but we do not have an explicit condition for uniqueness).

Indeed, let Q1' ... ' QN denote a set of (symmetric) PDtarget matrices, and let An denote the smallest eigenvalue ofQn' n == 1, ... ,N. Consider the function

(5)

1 NC(B) = log Idet BI - "2 L b;o.»;

n=lN 1 N

< log II Ilbnll - "2 L Anb;bnn=l n=lN

= ~ L {log IIbnl1 2- An llbn l1

2}

n=l

where en denotes the n-th column of the N x N identitymatrix I.

In other words, each transformed matrix BQnBTshould be exactly "diagonal" in its n-th column (and, sinceit is symmetric, also in its n-th row), in the sense that alloff-diagonal elements in these row and column must be ex­actly zero. All other elements may take arbitrary (nonzero)values. In addition, with the problem formulations above wealso require that the diagonal (n, n)-th element of BQnBTbe 1 - but this is merely a scaling constraint on the rows ofB - once any matrix B satisfying the exact off-diagonalitycriteria is found, it is straightforward to simply rescale eachof its rows such that b~Qn bn == 1, without any effect onthe off-diagonality property.

Multiplying both sides of (2) by A == B- 1 on the leftwe obtain

where 8mn denotes Kronecker's deltafunction (which is 1 ifm == nand 0 otherwise).Equivalently, the same problem can be stated as:P2: Given N target-matrices Q1' ...QN' find an N x Nmatrix B, such that

where E nm ~ enern denotes an all-zeros matrix with anonly 1 at the (n, m )-th location. By concatenating theseequations for m == 1, ... , N into a vector we get an ==Qnbn, which has to be satisfied for each n == 1, ... , N -

where II bn 112 ~ b~»; denotes the squared norm of bn, and

where we have used the properties

1. Idet BI :S n:=l IIbn ll (Hadamard's inequality);

2. b~Qnbn 2: An ll bn l/2

; and

3. log x - AX :S -log A-I for all x > o.

Consequently, noting also that C(a . B)~ -00 VB,C(B) attains a maximum for some nonsingular B, and itsderivative with respect to B at the maximum point vanishes.Indeed, differentiating C(B) with respect to B(n,m) (the(n, m)-th element of B) and equating zero we get (for allm,n E {I, ... ,N})

1 N< "2 L {- log An - I},

n=l

QnBT en == Aen =} an == Qnbn, n == 1, ,N (3)

where an denotes the n-th column of A == [b1 aN]. Inother words, the same problem can be stated as follows:P3: Given N target-matrices Q1' ...Q N' find two recipro­cal N x N matrices B and A == B-1

, such that the n-thcolumn ofA is given by Qnbn, with b~ being the n-th rowofB, n == 1, ... , N.

Assuming that all target-matrices are invertible, we maynow swap the roles between B and A, obtaining that the n­th column bn of B T is should be given by Q:;;la n, wherean denotes the n-th row of AT, n == 1, ... ,N. This meansthat the same problem formulations P1 and P2 above maybe cast in terms of AT (instead of B) with the inverses ofthe target-matrices substituting the target matrices. This im­plies that the "direct" and "indirect" formulations of HEADcoincide: If B is the HEAD diagonalizer of Q1, ... , QN'

then its transposed inverse A T is the HEAD diagonalizerof the inverse set Ql 1

, ... ,QN1. It is important to note

that this desirable "self-reciprocity" property, is generallynot shared by other non-orthogonal AJD algorithms. Infact, it is easy to show that this property is satisfied in non­orthogonal AJD when (and only when) the target matricesare exactly jointly diagonalizable. In general, however, thetarget matrices are not exactly jointly diagonalizable, sincethey are merely estimates of a set of such matrices. Nev­ertheless, the HEAD solution always enjoys the "self reci­procity", reflecting some kind of "self-consistency".

As obvious, e.g., from P1, HEAD requires the solu­tion of N 2 equations in N 2 unknowns (the elements of B).

8C(B)

8B(n,m)

N1", T T== A(m,n) - "2 Z:: 2ek EnmQkB ei;

k=lN

== A(m,n) - L 8kne~QkBTekk=l

== A(m,n) - e~Qnbn == 0

(6)

313

Page 3: ON HYBRID EXACT-APPROXIMATEJOINT DIAGONALIZATIONarie/Files/camsap09.pdfQnbn, which has to be satisfied for each n == 1, ...,N - where II bn11 2 ~b~»; denotes the squared norm ofbn,

3. SOLUTION OF HEAD

3.1. Multiplicative Updates (MU)

The basic idea in this approach is to iteratively update Busing multiplicative updates of the form

Unlike classical AJD, the HEAD problem and its solutionshave rarely been addressed in the literature. To the best ofour knowledge, so far only two different iterative algorithmshave been proposed (both in the context of ML or QMLBSS): One by Pham and Garat [3], which is based on multi­plicative updates of B, and the other by Degerine and Zaidi[2], which is based on alternating oblique projections withrespect to the columns of B.

Rewriting this equation with the roles of m and n ex­changed, we obtain two equations for the two unknownsP(n, m) and P(m, n), thus decoupling the original set ofdimensions N(N - 1) x N(N - 1) into N(N - 1)/2 setsof dimensions 2 x 2 each. The procedure is repeated itera­tively until convergence.

where iJ(n) denotes the matrix B without its n-th row (an(N - 1) x N matrix). This procedure (applied sequentiallyfor n == 1, ... , N) is repeated iteratively until convergence.Note that due to the "self-reciprocity" property of HEAD,an equivalent solution can be obtained by applying the sameprocedure to the columns of A, with the matrices Qn re­placed by their inverses.

3.2. Iterative Relaxation (IR) - Alternating Directions

Consider a specific value of n, and define a scalar product(u, u)Qn ~ uTQnu. Evidently, (1) implies that (with re-

spect to this scalar product) bn has unit-norm and is orthog­onal to all other rows of B. Thus, assuming all other rowsof B are fixed, bn is updated by (oblique) projection ontothe subspace orthogonal to all other rows, followed by scalenormalization:

(7)B f-- (1 - pT)B,

as required in formulation P3 above. This means that thesolution of HEAD can be expressed as the maximizer ofC(B), which, as mentioned above, always exists when thetarget matrices are all PD.

Naturally, this derivation is closely related to the factthat HEAD yields the ML (or QML) estimation of thedemixing matrix in some specific BSS contexts (e.g., [3,2])with some specific target-matrices. However, we obtainedhere a more general result, which holds for any set of PDtarget matrices, and not only for the specific matrices usedfor ML or QML estimation in [3, 2].

~ ~ ~

Qn(n,n)p(n,m) + Qn(m,m)p(m,n) == Qn(n,m), n -I- m,(11)

The implied update of the HEAD equations (1) in terms ofthe P can now be expressed as

T T ~

en (1 - P )Qn(1 - P)em == 8mn , (9)

which, when neglecting terms which are quadratic in the el­ements of P and ignoring the scaling equations (those withm == n), can also be written as

T~ T ~ ~

en QnPm + emQnPn == Qn(n,m), n -I- m (10)

where Pi denotes the i-th column of P, and where Qn(n,m)

denotes the (n, m)-th element ofOn. This yields N(N -1)linear equations for the N (N - 1) off-diagonal elements ofP, whose diagonal terms are arbitrarily set to zero.

Further simplification may be obtained if we assumethat all Qn are already "nearly diagonal", so that the off­diagonal elements in each row are negligible with respectto the diagonal element in that row. Under this assumption,the linear equations (10) can be further approximated as

6. B2C(B )H(ix(m,n),ix(p,q)) == BB BB

(m,n) (p,q)

== 8B8

{A(n,m) ~ e~Qmbm}(p,q)

== -e~AEpqAem - e~Qmeq ·8mp

== -A(n,p)A(q,m) - Qm(n,q) ·8mp, (13)

(where we have used BA == -A . BB . A). The key ob­servation here, is that if we differentiate at B == 1, thenH becomes considerably sparse, since at B == 1 we haveA(n,p)A(q,m) == 8np8qm. The solution ofthe associated sys­tem ofN 2 x N 2 equations (for the update ofall N 2 elements

3.3. The proposed approach - Newton's method withconjugate gradient (NeG)

We propose to apply Newton't method to the nonlinearequations (6). To this end, let us define the N 2 x N 2 Hessianmatrix H as follows. First, we define the indexing function

ix(m, n) ~ (m - l)N + n, which determines the locationof B(m,n) in vec(BT) (the concatenation of the columns of

B T into an N 2 x 1 vector). Then, differentiating (6) withrespect to B(p,q) we get (for all m, n, P, q E {I, ... , N})

(8)

where P is a "small" update matrix. Let us denote

~ t::. TQ n == BQnB , n == 1, ... ,N

314

Page 4: ON HYBRID EXACT-APPROXIMATEJOINT DIAGONALIZATIONarie/Files/camsap09.pdfQnbn, which has to be satisfied for each n == 1, ...,N - where II bn11 2 ~b~»; denotes the squared norm ofbn,

- 10

- 15

N=5, arbitrary matrices

4. REFERENCES

N=20, arbitra ry matr ices

50 100number of iterati ons

Figure 1: rms error for MU, IR and NCG with AJD matrices, N = 3,9(left) and for IR and NCG with arbitrary matrices, N = 5,20 (right).

Fig. 1 shows typical convergence patterns. We distin­guish between two cases: one in which all Qn are indeedapproximately jointly diagonalizable (termed "AJD matri­ces" in the figure) and one in which theey are arbitrary (sym­metric PD) matrices (termed "arbitrary matrices"). The rea­son for the distinction is that the MU algorithm (with (11))inherently assumes that the matrices are nearly jointly di­agonalizable. Recall, however, that this is not a necessarycondition for the existence of a solution to the HEAD prob­lem - and we therefore also present the case of arbitrarytarget-matrices, but in these cases the MU algorithm doesnot converge, and is therefore omitted from the plots.

The figure shows the evolution of the log of the resid­ual root-mean-squares (rms) HEAD diagonalization errorvs. iteration number. In the left-hand side plots (for AJDmatrices) we show all three algorithms, which were all sim­ilarly initialized with the (exact) joint diagonalizer of Qland Q2' In the right-hand side plots (for arbitrary matri­ces) we show the IR algorithm (initialized with B = 1) andthe NCG algorithm, initialized with the output of the IR al­gorithm obtained as soon as the log rms error falls below- 2.5. It is evident that the transition to the NCG algorithm,with its quadratic convergence, significantly accelerates theconvergence. This is obtained at the cost of only a moderateincrease in the computational load per iteration: IR requiresO(N4 ) multiplications per iteration, whereas NCG requiresapproximately O( N 5 ) .

I. Update the transformed target matrices

2. Construct the deviation £ of the updated gradientequations (6) at B = I from zero,

of B) can then be attained with relative computational sim­plicity using the conjugate gradient method (which exploitsthis sparsity). Note, indeed, that with B = 1 we have

3. Find the correction matrix a, given by

where the Bdiagj -) operator creates a block-diagonal ma­trix from its matrix arguments, and where P is merely apermutation matrix transforming the vecf) of a matrix intothe veci -) of its transpose, namely for any N x N ma­trix X , we have Pvec(X) = vec(XT

) (note also thatP = p T = p - l). Therefore, the operation of H on Xcan be easily expressed as:

Use the conjugate gradient method', exploiting thesparsity o!H by the use of (15) (with each Qn substi­

tuted by Qr;) for calculating expressions of the formH 'vec(X ) along the way.

4. Apply the correction B f- (1 + a)B

where X l , .. . X N denote the columns of X.Luckily, the structure of the problem enables us to al­

ways work in the vicinity of B = 1 , as each update of Bcan be translated into transformation of the target matrices,defining a "new" problem in terms of the transformed ma­trices. Then, B = 1 can again be conveniently used as an" initial guess" for the new problem. Summarizing our algo­rithm, given the target matrices and some initial guess ofB ,we repeat the following until convergence

This algorithm is similar in structure to the full version (l0)of the multiplicative updates algorithm above. However, itis based on an iterative solution of (6) (as opposed to (1)),which conveniently lends itself to the use of a conjugategradient algorithm in each Newton iteration, by exploitingthe sparsity of H.

1Conjugate gradient requires - H to be PD. Although

Bdiag(01 , . . . ON ) is I'D, - H is generally not I'D, since P isnot PD. We may thus use the eonjugate-gradient-squared algorithm.

[I] J.-F. Cardoso, A. Souloumiac, "Jacobi angles for simultaneous diagonaliza­tion", SIAMJ. on Matrix Anal. andAppl., 17:161-164, 1996.

[2] S. Degerine and A. Zaidi, "Separation of an Instantaneous Mixture of GaussianAutoregressive Sources by the Exact Maximum Likelihood Approach", IEEETr.Signal Processing, 52(6):1492-1512, 2004.

[3) D.-T. Pharo and Ph. Garat, "Blind Separation of Mixtures of IndependentSources Through a Quasi Maximum Likelihood Approach," IEEE Trans. SignalProcessing, 45(7):1712-1725 , 1997.

[4] P. Tichavsky, A. Yeredor, "Fast Approximate Joint Diagonalization Incorporat­ing Weight Matrices," IEEE Tr. Signal Processing, 57(3):878-891, 2009.

[5] A. Yeredor, "Blind separation of Gaussian sources via second-order statis­tics with asymptotically optimal weighting," IEEE Signal Processing Letters,7(7):197-200,2000.

315