Neural Networks Lesson 8 - Blind Source Separationispac.diet.uniroma1.it/scarpiniti/files/NNs/Less8.pdfIntroduction Other BSS techniques Basic Concepts The InfoMax algorithm Introduction

IntroductionOther BSS techniques

Neural NetworksLesson 8 -

Blind Source Separation

Prof. Michele Scarpiniti

INFOCOM Dpt. - “Sapienza” University of Rome

http://ispac.ing.uniroma1.it/scarpiniti/index.htm

[email protected]

Rome, 19 November 2009

M. Scarpiniti Neural Networks Lesson 8 - Blind Source Separation 1 / 69

http://ispac.ing.uniroma1.it/scarpiniti/index.htm


1 IntroductionBasic ConceptsThe InfoMax algorithm

2 Other BSS techniquesNonlinear BSSComplex-value BSS



Basic ConceptsThe InfoMax algorithm

Introduction

Introduction




General mixing model

The general mixing model is shown by the figure

x [n] = F {s [n]} = F {s1 [n] , ..., sN [n]}

where:

s[n] is the vector of the N original and independent sources;

F is a mixing operator;

x[n] is the vector of the observed signals, called mixtures.




General de-mixing model

The general de-mixing model is shown by the figure

u [n] = G {x[n]} = G ◦ F {s [n]}

where:

G is the recovering or de-mixing operator ;

u[n] is the source estimate.

One can use only the statistically independence of the sources: ⇒Independent Component Analysis (ICA).




Ambiguity of the solution

If F has not a particular structure, separation could be not possible:in particular one can have infinity solutions G.

For this reason we consider only linear and instantaneous models F orPost Nonlinear (PNL).

Remembering the trivial ambiguities of the ICA algorithms, it is:

u [n] = PΛs [n]

where:

P is a permutation matrix;Λ is a diagonal or scaling matrix.

For the nonlinear case it is present an off-set ambiguity too.




History of ICA

Classic ICA: linear mixtures1 ICA was born on 1986 (Herault)√

Neurometric approach2 Organic development by 1994 (Comon, Jutten, Cardoso, etc.)

a Statistical approach (HOS)√Cumulant based method (Amari & al.)

1 Kurtosis maximization (Amari & al., Jutten)

b Information theoretic approach√FAST ICA (Hyvarinen & al.)√INFOMAX (Bell & Seynowski)

1 Maximization of Joint Entropy (ME)2 Minimization of Mutual Information (MMI)

c Different heuristics√Geometric approach (Babaieh-Zadeh)

Nonlinear ICA1 Organic development by 1999 (Jutten & Taleb)

2 Open problem




Different ICA approaches












Linear BSS

The simplest problem formulation is in the linear and instantaneous environment: weobserve N random variables x1, . . . , xN , which are modeled as linear combinations of Nrandom variables s1, . . . , sN :

x = As (1)

where aij are the entries of the mixing A matrix.

The aim of BSS algorithms is to identify a de-mixing matrix W, in order to obtain thatthe components of the output vector

u = Wx (2)

are as statistically independent as possible.




Linear mixing model

Giving a look to the linear (and instantaneous) mixing model more in depth,we can obtain the following model:

x1 = a11s1 + a12s2 + . . .+ a1NsN

x2 = a21s1 + a22s2 + . . .+ a2NsN...

xN = aN1s1 + aN2s2 + . . .+ aNNsN

We want a similar model for the linear de-mixing model too.




Linear de-mixing model

The solution to this problem can be addressed by the well-known ICA algorithms,like FastICA, JADE, etc. Another well-performing algorithm is the InfoMax, pro-posed by Bell & Sejnowski in 1995.

InfoMax addresses the problem of maximizing the mutual informationI (y; x), between the input vector x and an invertible nonlinear transform ofit, y obtained as

y = h(u) = h(Wx)

where W is the N × N de-mixing matrix and h(u) = [h1(u1), . . . , hN(uN)]T

is a set of N nonlinear function.

Note that the network used bythe Bell & Sejnowski algorithmis a single layer neural network.In this way the set of the non-linear functions are the activa-tion function of the neural net-work. For this reason the func-tions hi (ui ) are usually calledactivation function or AF.




Linear BSS system

The whole system is shown in the following learning scheme:

A cost function measuring the statistic independence of the network output y is

optimized, then the network free parameters (matrix weights, or nonlinear

function parameters) are changed during the learning.




The InfoMax algorithm

We can analyze as the InfoMax algorithm works. Because the mappingbetween input and output is deterministic, maximizing I (y, x) is the samethat maximizing the joint entropy H(y). In fact the following relation holds:

I (y, x) = H (y)− H (y| x)

where H(y|x) is whatever entropy the output has which did not come fromthe input. In the case that we do not know what is noise and what is signalin the input, the mapping between x and y is deterministic and H(y|x) hasits lowest possible value (it diverges to −∞). The above equation can bedifferentiated as follows, with respect to a parameter, w involved in themapping from x to y:

∂

∂wI (y, x) =

∂

∂wH (y) (3)

because H(y|x) does not depend on w .In this way InfoMax is equivalent to the entropy maximization or ME.M. Scarpiniti Neural Networks Lesson 8 - Blind Source Separation 15 / 69




In order to derive the learning algorithm let we pose px (x) and py (y) the probabilitydensity functions (pdf) of the network input and output respectively which have to satisfythe relation:

py (y) =px (x)

|det J|where |•| denotes the absolute value and J the Jacobian matrix of the transformation:J = [∂yi/∂xj ]ij .Since the joint entropy of network output is defined as H (y) = −E {ln py (y)}, we obtain:

H (y) = E {ln |det J|}+ H (x) .

Now we can note that ∂yi∂xj

= ∂yi∂ui

∂ui∂xj

= h′i (ui ) · wij , so we obtain

ln |det J| = ln det W +N∑

i=1

ln∣∣h′i ∣∣.

Hence, the expression of the joint entropy H(y) (ignoring the expected value operatorE {•}, replacing by instantaneous values) is:

H (y) = H (x) + ln det W +N∑

i=1

ln∣∣h′i ∣∣. (4)





The maximization (or minimization) of a generic cost function L{Φ} withrespect a parameter Φ can be obtained by the application of the stochasticgradient method at (l + 1)-th iteration

Φ (l + 1) = Φ (l) + ηΦ∂L {Φ (l)}

∂Φ= Φ (l) + ηΦ∆Φ (l)

where ηΦ is the learning rate.Remembering that H(x) is not affected by the parameters that we are learn-ing, it is possible to write the learning rule for the matrix W as follows:

∆W =∂H (y)

∂W= W−T + ΨxT (5)

where W−T =(W−1

)T, Ψ = [Ψ1, . . . ,ΨN ]T and Ψk = h′′k (uk)/h′k (uk).




The InfoMax algorithm: choice of nonlinearity

The nonlinear transformations hi (ui ) are necessary for bounding the entropy in a finiterange. Indeed, when hi (ui ) is bounded c ≤ hi (ui ) ≤ d , for any random variable ui theentropy of yi = hi (ui ) has an upper bound:

H(yi ) ≤ ln(d − c)

Therefore, the joint entropy of the transformed output vector is upper bounded:

H(y) ≤N∑

i=1

H(yi ) ≤ N ln(d − c)

In fact, the above inequality holds for any bounded transforms, so the global maximum

of the entropy H(y) exists. H(y) may also have many local maxima determined by the

functions hi used to transform u. If we chose c = 0 and d = 1, the the global maximum

is reached at H(y) = 0.




The InfoMax algorithm: relation to ICA

But what is the relationship between INFOMAX and ICA?For this scope we can introduce the mutual information of the linear outputs u as the

Kullback-Leibler divergence of the output distribution I (u) = E{

ln(pu (u)

/∏Ni=1 pui (ui )

)},

where pu(u) is the joint pdf of the output vector u and pui (ui ) are the marginal pdfs. Then:

H (y) = −∫

py (y) ln py (y)dy = −E {ln py (y)} =

= −E

ln pu(u)N∏

i=1|h′

i |

= −E {ln pu (u)}+ E

{ln

N∏i=1

|h′i |}

=

= −E {ln pu (u)}+ E

{ln

N∏i=1

pui (ui )

}− E

{ln

N∏i=1

pui (ui )

}+ E

{ln

N∏i=1

|h′i |}

=

= −E

ln pu(u)N∏

i=1pui

(ui )

+ E

{N∑

i=1

ln|h′

i |pui

(ui )

}=

= −I (u) + E

{N∑

i=1

ln|h′

i |pui

(ui )

}and we obtain:

H (y) = −I (u) + E

{N∑

i=1

ln|h′i |

pui (ui )

}M. Scarpiniti Neural Networks Lesson 8 - Blind Source Separation 19 / 69



The InfoMax algorithm: ME and MMI approaches

H (y) = −I (u) + E

{N∑

i=1

ln|h′i |

pui (ui )

}(6)

Thus if |h′i | = pui (ui ) (∀i) then maximizing the joint entropy H(y) isequivalent to minimizing the mutual information, that is theKullback-Leibler divergence (which is a measure of the independence of theui signals) and so the ICA problem is solved. In this way hi (ui ) should bethe cumulative density function (cdf) of the i-th estimated source. The useof an adaptive AF can successfully fulfills the matching of hi (ui ) to the cdfof the i-th source.

Moreover the use of a cdf -like function for the hi functions allows the jointentropy H(y) to have in H(y) = 0 its global maximum (because c = 0 andd = 1).

From (6) the InfoMax algorithm can be performed by two equivalentapproach: maximizing the joint entropy of the network output (MEapproach) or minimizing the mutual information (MMI approach).




The flexible AF: spline neuron

In order to obtain an adaptive or flexible activation function one can usethe spline neuron.

The spline neuron can be described by the following figure:

In the second block the terms Mi are the columns of the M matrix,which depends by the particular used basis.




The flexible AF: spline neuron

A matrix formulation of the output value yi can be described as:

yi = hi (u) = T ·M ·Qi

where

T =[u3 u2 u 1

],

Qi =[

Qi Qi+1 Qi+2 Qi+3]T

e

M = 12

−1 3 −3 12 −5 4 −1−1 0 1 00 2 0 0

Matrix M determines the characteristic of theinterpolant curve: CR-Spline or B-Spline

In order to grant a monotonic characteristic of the overall function, the followingconstraint must be imposed: Q1 < Q2 < . . . < QN .




Different spline basis

1 B-Spline:

M =1

6

−1 3 −3 13 −6 3 0−3 0 3 01 4 1 0

2 Bezier Spline:

M =1

6

−1 3 −3 13 −6 3 0−3 3 0 01 0 0 0

3 Catmull-Rom Spline:

M =1

2

−1 3 −3 12 −5 4 −1−1 0 1 00 2 0 0

4 Hermite Spline:

M =

2 −2 1 1−3 3 −2 −10 0 1 01 0 0 0




A new InfoMax algorithm

A new algorithm can be derived using the InfoMax principle: maximizing the joint entropyof the network output. Now we have two sets of free parameters of the network: entriesof the W matrix and the spline control points. We have to derive two learning rules.

1 We can derive a similar learning rule for the matrix weights:

∆W =∂H (y)

∂W= W−T + ΨxT

But now it is

Ψi =h′′i (ui )

h′i (ui )=

1

∆

TMQi

TMQi

where ∆ = Qi+1 − Qi is the distance between two adjacent control point,T =

[3u2 2u 1 0

]and T =

[6u 2 0 0

].

2 For the learning rule of the spline control points, we have:

∆Qmi ∝ ∂H(y)

∂Qmi

=∂

[N∑

i=1ln h′

i (ui )

]∂Qm

i= 1

h′i (ui )

∂h′i (ui )

∂Qmi

=

= ∆

TMQi· ∂(TMQi)

∆∂Qmi

= TMm

TMQi, m = 0, . . . , 3

where Mm is the m-th column of the matrix M.




The Natural Gradient

The stochastic gradient is a very simple approach, but if data lie in a Riemannianspace S = ϑ ∈ C, convergence can be very slow or one can obtain local solutions.

In order to avoid these problems, given a generic functional L(ϑ) defined in S , theconcept of natural gradient was introduced:

∇L(ϑ) = G−1(ϑ)∇L(ϑ)

where ∇L(ϑ) is the classical stochastic gradient, G−1(ϑ) is the inverse of theRiemann curvature tensor and ∇L(ϑ) is the natural gradient.

The Riemann curvature tensor expresses cur-vature of Riemannian manifolds. In this waythe natural gradient can evaluate a gradi-ent considering the curvature of the spacein which data are defined. A classical exam-ple is the distance between two cities in theworld: the “true” distance is an arc on thesurface of the world and not a line betweenthe two points.





Amari has demonstrated that the inverse of the Riemann curvature tensor isvery simple for the set of non-singular matrices Wn×n, denoted by GL(n,R).We have:

∇L(W ) = ∇L(W )W T W

The InfoMax algorithm can be reformulated in this new way:

∆W = η(

W−T + ΨxT)

WT W = η(

I + ΨxT WT)

W =

= η(

I + ΨuT)

W

This algorithm is more efficient than the classic one, essentially for tworeasons:

1 it avoids the local minima;2 it avoids the evaluation of the inverse of the W matrix.





In addition four new Riemannian metrics have been introduced by

Arcangeli et al., 2004 and Squartini et al., 2005, for the improving of theconvergence speed.

In this way we can obtain 5 new expressions for the natural gradient:

∇RL(W ) = ∇L(W )W T W (7)

∇LL(W ) = WW T∇L(W ) (8)

∇LRL(W ) = (WW T )∇L(W )(W T W ) (9)

∇RRL(W ) = ∇L(W )(W T W T WW ) (10)

∇LLL(W ) = (WWW T W T )∇L(W ) (11)

defined as right natural gradient (7) (that is the standard naturalgradient), left natural gradient (8), right/left natural gradient (9),right/right natural gradient (10) and left/left natural gradient (11).





Using the previous expressions, we can obtain the new formulationsfor the learning rules of the de-mixing matrix W: fro the equation

∆W = W−T + ΨxT

we obtain the five rules:

∆W = (I + ΨuT )W (12)

∆W = W(I + WT ΨxT ) (13)

∆W = WWT (I + ΨuT )W (14)

∆W = (I + ΨuT )WT WW (15)

∆W = WWWT (I + WT ΨxT ) (16)





Not all these new gradient definitions work efficiently for BSS problem: someof these equations do not satisfy the equivariance property (Cardoso & Laheld, 1996).

Definition

An estimator A of A (A = A(X)) is equivariant if for every invertiblematrix M is:

A (MX) = MA (X)

This implies that, defined C = WA, the natural gradient algorithm(12) depends only from C:

∆W · A = ∆C = (I + ΨuT )WA = (I + ΨuT )C





The equivariance property is not satisfied by all algorithms (13)-(16).

If we consider that the mixing matrix A is unitary (AAT = AT A = I),then:

1 ∆W · A = W(I + ΨxT )A2 ∆W · A = WWT (I + ΨuT )WA = WAAT WT (I + ΨuT )C =

CCT (I + ΨuT )C3 ∆W · A = (I + ΨuT )WT WWA = (I + ΨuT )WT AT AWC =

(I + ΨuT )CT CC4 ∆W · A = WWWT (I + ΨxT )A

It is possible to note that the left (13) and left/left (16) algorithmsdo not satisfy the equivariance property: we suppose that they do notwork efficiently in BSS solution.




Convergence and universal nonlinearity

The InfoMax algorithm with the natural gradient, converges if (Mathis et al., 2002)

E{

Ψ′ (u)}

E{

u2}

+ E{

Ψ (u) uT}> 0 (17)

Mathis has demonstrated that there not exist an “universal” nonlinear func-tion, which satisfy the previous equation for every input signal.

Using spline functions the previous constraint is always satisfied.

Function Ψi∆=

p′i (x)pi (x) ≡

h′′i (u)

h′i (u) is defined as score function. Because

we use an adaptive function, at convergence the activation function,must assume the profile of the cdf of the input signal, in order tominimize the last terms (6). In this way the functions Ψi tends to thescore functions and can be demonstrated that condition (17) alwaysholds.




Under-determined BSS

If the number M of sensors is less than the number N of sources, the problem(known as under-determined or over-complete BSS) is more complicated.A useful approach in this case could be the use of a geometric algorithm.

This approach is particular performing if the source are sparse.Sparsity is a measurement of zero elements in the source signal.Greater the sparsity is, more super-gaussian the signal is.If signals are enough sparse, the scatter-plot of the mixtures revealssome “characteristic” directions. From these directions the mixingmatrix A can be estimated as follows from 2 sensors and N sources:

A =

[cosα1 cosα2 · · · cosαN

sinα1 sinα2 · · · sinαN

]





In order to improve the sparsity of the mixtures it is possible to perform some time-frequency transformation on the time signal. A good choice is

1 the Short-Time Fourier Transform (STFT): the function to be transformed ismultiplied by a window function which is nonzero for only a short period of time.The Fourier transform (a one-dimensional function) of the resulting signal is takenas the window is slid along the time axis, resulting in a two-dimensionalrepresentation of the signal, written as:

STFT {x (t)} ≡ X (ω, τ) =

∞∫−∞

x (t)w (t − τ) e−jωtdt

2 some Wavelet Transformation. A wavelet is a wave-like oscillation with anamplitude that starts out at zero, increases, and then decreases back to zero. Awavelet transform is used to divide a time signal into wavelets. It possesses theability to construct a time-frequency representation of a signal that offers verygood time and frequency localization. The wavelet transform at scale a > 0 andtranslation value b ∈ R is

Xω (a, b) =1√a

∞∫−∞

x (t)Ψ∗(

t − b

a

)dt

where Ψ∗(t) is a continuous function in both the time domain and the frequencydomain called the mother wavelet.





The sparsity enhancement is evident from the following pdf plot of a time-domain and a STFT-domain signal

After this preprocessing step,a clustering algorithm, like a K-means, can beused in order to detect the angles αi for the estimation of the mixing matrix(Blind Mixing Model Recovery (BMMR) step);Another problem is the recovering the original sources, solving thecorresponding under-determined (non-square) system (Blind SourceRecovery (BSR) step). In literature there exist several ways to solve such asystem.




Blind Source Extraction (BSE)

BSS can be computationally very demanding if the number of source signals islarge. Fortunately, Blind Source Extraction (BSE) overcomes this difficulty byattempting to recover only a small subset of desired independent components froma large number of sensor signals: most of the existing BSE criteria and associatedalgorithms only recover one independent component at a time.

A single processing unit (a neuron) is used in the first step to extract oneindependent source signal with the specified stochastic properties. In thenext step a deflection technique is used in order to eliminate the alreadyextracted signals from the mixtures.





A learning rule for the extraction of the first independent component, can be derived,after whitening, minimizing the following cost function:

J1(w1) = −1

4|k(y1)| = −β

4k(y1)

where k(y1) is the kurtosis of the first output y1 and β = ±1 determines the sign of thekurtosis of the extracted signal. Applying the standard gradient descent, we have:

∆w1 = −µ1∂J1w1

∂w1= µ1β

m4(y1)

m32(y1)

[m2(y1)

m4(y1)E{

y 31 x1

}− E {y1x1}

]where µ1 > 0 is the learning rate, m2(y1) = E

{|y |2}

and m4(y1) = E{|y |4}

. The termm4(y1)/m3

2(y1) is alway positive and can be absorbed in the learning rate. Let us pose

ϕ(y1) = βm4(y1)

m32(y1)

[m2(y1)

m4(y1)y 3

1 − y1

]we obtain the following learning rule:

∆w1 = µ1ϕ(y1)x1 (18)





After successful extraction of the first source signal y1 ≈ s1, we can applythe deflation procedure which removes previously extracted signals from themixtures. This procedure may be recursively applied to extract all sourcesignals:

xj+1 = xj − wjyj j = 1, 2, . . .

where the filter wj can be optimally estimated by minimization of the costfunction

Jj (wj) = E{

xTj+1xj+1

}= E

{xTj xj

}− 2wT

j E {xjyj}+ wTj wjE

{y 2j

}with respect to wj leads to the simple updating equation:

wj =E {xjyj}

E{

y 2j

} =E{

xjxTj

}wj

E{

y 2j

}M. Scarpiniti Neural Networks Lesson 8 - Blind Source Separation 37 / 69


Nonlinear BSSComplex-value BSS

Other BSS techniques

Other BSS techniques




Nonlinear BSS

Unfortunately linear instantaneous mixing models are too unrealistic and

unsatisfactory in many applications. A more realistic mixing system inevitably

introduces a nonlinear distortion in the signals. In this way the possibility of

taking into account these distortions can give better results in signal separation.

The problem is that in the nonlinear case the uniqueness of the solution is not

guaranteed. The solution becomes easier in a particular case, called Post

Nonlinear (PNL) mixture, well-known in literature in the case of the real domain.

In this context the solution is unique too.

The mixing procedure is realized in twosteps: first there is a linear mixing, theneach mixtures is obtained as a nonlineardistorted version of the linear mixtures.The PNL mixture model is quite probablein many practical situations.




Nonlinear BSS

If the mixing-separating system is nonlinear and no other assumption is given for themixing operator, a generic de-mixing model does not assure the existence and uniquenessof the solution. In order to better illustrate this aspect, see the following example.

Example

Consider two independent random variables s1 with uniform distribution in [0, 2π) and s2 with

Rayleigh distribution, so that its pdf is ps2 (s2) = s2

σ22e−

s22/2 with variance σ2

2 = 1.

Given the two nonlinear transformations y1 = s2 cos s1 and y2 = s2 sin s1, the random variablesy1 and y2 are still independent but are Gaussian distributed, so they cannot be separated as aconsequence of the Darmois-Skitovich’s theorem. In fact the Jacobian J of this transformationis:

det (J) = det

(∂y1∂s1

∂y1∂s2

∂y2∂s1

∂y2∂s2

)= det

(−s2 sin s1 cos s1

s2 cos s1 sin s1

)= −s2

so the joint pdf of y = [y1, y2] can be expressed as

py1,y2 (y1, y2) =ps1,s2

(s1,s2)

|det(J)| = 12π

exp

(− y2

1 +y22

2

)=

=

(1√2π

exp

(− y2

12

))(1√2π

exp

(− y2

22

))≡ py1 (y1) · py2 (y2)




Nonlinear BSS: the mirror model

In order to solve the BSS problem in nonlinear environment, using PNLmixtures, the following de-mixing model is adopted:

This model is known as mirror model , because the de-mixing model is themirror image of the mixing one. In formulas

v = Asx = F (v) , xi = fi (vi )r = G (x) , ri = gi (xi )u = Wr




Nonlinear BSS system

The whole system is shown in the following learning scheme:

The learning architecture must adapt the matrix weights wij , the nonlinearactivation function hi (ui ) and the compensating functions gi (xi ) of thedistorting functions fi (vi ).




Nonlinear BSS: a flexible solution

In order to obtain an adaptive function representation for activation functionand compensating the distorting functions, we can adopt splines. In thisway we have three sets of free parameters:

1 the entries wij of the de-mixing matrix;

2 the control points Qhi of the AFs;

3 the control points QGi of the compensating the distorting functions

gi (xi ).

Using the InfoMax algorithm, we have to maximize the following cost func-tion:

∂

∂ΦL{y,Φ} =

∂

∂Φ

[ln |det (W)|+

N∑k=1

ln yk +N∑

k=1

ln rk

]where the free parameters, are Φ =

{wij ,Qh

i ,QGi

}.




Nonlinear BSS: a flexible solution

Maximization of the previous equation by the stochastic gradient method yields tothree learning rules:

1 The learning rule for the network’s weights is:

∆W = W−T + ΨrT

where Ψi = h′′i (ui )/h′i (ui ) as usual.

2 The learning rule for the spline activation functions is:

∆Qhi+m =

TMm

TMQh

i+m

where Mm is the m-th column of the M matrix.

3 The learning rule for the spline compensating functions is:

∆QGi+m =

TMm

TMQG

i+m

+ ΨWHk TMm

where Wk is a vector composed by the k-th column of the matrix W.




Indeterminacy of nonlinear BSS

In PNL BSS we have not only the usual scaling and permutation ambiguity

Definition

Let A ∈ Gl(n) be an invertible matrix, then A = (aij)i ,j=1...n is said to beabsolutely degenerate if there are two columns l 6= m such that a2

il = λa2im

for a λ 6= 0, i.e. the the normalized columns differ only by the signs of theentries.

A not so obvious indeterminacy occurs if A is absolutely degenerate:only the matrix A but not the nonlinearities can be recovered.

An additional indeterminacy come into play because of translation:the distorting functions fi can only be recovered up to a constantoff-set.




Other BSS environment

The nature of some problem involves a natural solution in a complex environment,due to the need of frequency domain signal processing which is quite common intelecommunication and biomedical applications.The problem formulation and solution is similar to the real-value one, except thatall quantities involved in model are complex-value. For example let us consider avector s = [s1, . . . , sN ]T of N complex sources (s ∈ CN). The k-th source can beexpressed as sk = sRk + jsIk , where sRk and sIk are the real and imaginary partsof the k-th complex-valued source signal and j =

√−1 is the imaginary unit. The

goal of complex BSS is to recover the complex signal s from observations of thecomplex mixture x = [x1, . . . , xN ]T , where the k-th mixture can be expressed asxk = xRk + jxIk , xRk and xIk are its real and imaginary part. In this way the modelin real-valued case are still valid but the mixing matrix A and the de-mixing matrixW are complex matrices (aij ∈ C and wij ∈ C):

x = Asu = Wx

In this way all the real-valued algorithm, like the InfoMax, can be generalized to

the complex-valued environment.M. Scarpiniti Neural Networks Lesson 8 - Blind Source Separation 46 / 69



Example 1: coctail-party problem

A first example is the well-known coctail-party problem in reverberant envi-ronment:

x (t) = A (t) ∗ s (t)STFT⇔ X (f , t) = A (f ) S (f , t)

A convolutive BSS problem can be addressed as a set on N instantaneousproblem, on for each frequency bin.




Example 1: coctail-party problem

An example of spectrogram of two sources and two mixtures.




Example 2: fMRI problem

A second example is the Functional Magnetic Resonance Imaging or fMRI

Two orthogonal scanners acquire a slice of temporal voxels in order tomeasure the Blood-oxygen-level dependent (BOLD).




Example 2: fMRI problem

Example of fMRI images, we have 5 super-gaussian, 1 gaussian and 2 sub-gaussiansignals:




Example 3: communication problem

A third example is the band-pass transmission

Usually it is used the complex envelope

x (t) = xR (t) + jxS (t)




Example 3: comunication problem

Example of an 8-PSK, a 16-QAM, a 4-QAM and a uniform noise signal.




The complex activation function (AF)

One of the main issues in designing complex neural networks is the presence ofcomplex nonlinear activation functions h(z) (where z = zR + jzI ) involved in thelearning processing.

The main challenge is the dichotomy between boundedness and analyticityin the complex domain, as stated by the Liouville’s theorem: complexfunctions, bounded on the whole complex plane, are either constant or notanalytic. Thus this kind of complex nonlinear functions are not suitable asactivation functions of neural networks.

In particular Georgiou & Koutsougeras defined the following set ofproperties:

1 h(z) is nonlinear in zR e zI ;2 h(z) is bounded: |h(z)| ≤ c <∞;3 partial derivatives hR

zR, hR

zI, hI

zRe hI

zIexist and are bounded;

4 h(z) is not entire;5 hR

zRhI

zI6= hR

zIhI

zR.

where hR(zR , zI ) and hI (zR , zI ) are known as the real part function andimaginary part function of the complex function h(z) respectively.




The complex activation function (AF)

In this context Kim and Adali proposed the use of the so-called elementary transcen-dental functions (ETF). They classified the ETFs into two categories of unboundedfunctions, depending on which kind of singularities they possess:

1 Circular functions: tan (z), sin (z) and cot (z);

2 Inverse circular functions: tan−1 (z), sin−1 (z) and cos−1 (z);

3 Hyperbolic functions: tanh (z), sinh (z) and coth (z);

4 Inverse hyperbolic functions: tanh−1 (z), sinh−1 (z) and cosh−1 (z).

As expected the trigonometric and the corresponding hyperbolic functions behave

very similarly.




The complex activation function (AF): the splittingsolution

According to the properties listed above, in order to overcome the dichotomy be-tween boundedness and analyticity, complex nonlinear splitting functions have beenintroduced. In this approach real and imaginary parts are processed separately byreal-valued nonlinear functions. The splitting function

h (z) = h (zR , zI ) = hR (zR) + jhI (zI )

avoids the problem of unboundedness of complex nonlinearities, as stated above,but it cannot be analytic.




The complex activation function (AF): the generalizedsplitting solution

The splitting model of a nonlinear complex valued function is not realistic becauseusually the real and imaginary part are correlated. According to this issue it isuseful to perform a more realistic model of the nonlinear functions. In this way,Vitagliano et al. (2003) proposed a complex neural network based on a couple ofbi-dimensional functions called generalized splitting function:

h (z) = h (zR , zI ) = hR (zR , zI ) + jhI (zR , zI )

In this way h(z) is bounded but it is not analytic. The Cauchy-Riemann conditions(hR

zR= hI

ZI, hI

zR= −hR

ZI) are not satisfied by the complex function itself, but can

be imposed by an algorithm constraint during the learning process.




Spline implementation of splitting functions

The new idea is to use spline-based functions for imple-menting the splitting activa-tion function. Thus, remem-bering the spline matrix for-mulation we can firstly eval-uate the indexes span iR andiI , and the local parameters νR

and νI , then we have

yk = hRk (uRk) + jhI

k(uIk) = TR ·M ·QRiR

+ jTI ·M ·QIiI




The generalized spline neuron

A matrix formulation of the output yi can be written as:

yi = hi (uR , uI ) = Tν,2 ·M · (Tν,1 ·M ·Qi )T

dove

Tν,k =[u3k u2

k uk 1]

, k = 1, 2

Qki =

Q

(i1,i2)j Q

(i1,i2+1)j Q

(i1,i2+2)j Q

(i1,i2+3)j

Q(i1+1,i2)j Q

(i1+1,i2+1)j Q

(i1+1,i2+2)j Q

(i1+1,i2+3)j

Q(i1+2,i2)j Q

(i1+2,i2+1)j Q

(i1+2,i2+2)j Q

(i1+2,i2+3)j

Q(i1+3,i2)j Q

(i1+3,i2+1)j Q

(i1+3,i2+2)j Q

(i1+3,i2+3)j

e

M = 12

−1 3 −3 12 −5 4 −1−1 0 1 00 2 0 0

The matrix M shows the characteristic of the interpolant

curve: CR-Spline or B-Spline

In order to satisfy a monotonic characteristic of the overall function, the followingconstraint must be imposed: Qk

1 < Qk2 < . . . < Qk

N .




Spline implementation of generalized splitting functions

Unfortunately the real andimaginary parts of a complexsignal are usually correlated,not split in separate channels.In this way we need a bet-ter model of the complex AF.We can render each of the twobi-dimensional real functionshR(ur , uI ) and hI (ur , uI ) withbi-dimensional splines: oneplays the role of the real partand one the imaginary part ofthe complex activation func-tion

yRk = hRk (uRk , uIk) = TνI (uIk) ·M · (TνR(uRk) ·M ·Q(iR ,iI )

[2]R )T ,

yIk = hIk(uRk , uIk) = TνI (uIk) ·M · (TνR(uRk) ·M ·Q(iR ,iI )

[2]I )T ,

yk = yRk + jyIk




The de-mixing algorithm

In order to derive the de-mixing algorithm, let us pose

x [n] =

[xR [n]xI [n]

]=

[xR1(n) . . . xRN(n)xI1(n) . . . xIN(n)

]

u [n] =

[uR [n]uI [n]

]=

[uR1(n) . . . uRN(n)uI1(n) . . . uIN(n)

]y [n] =

[yR [n]yI [n]

]=

[h (uR [n])h (uI [n])

]

We utilize the InfoMax algorithm.

As cost function we use the joint entropy (ME approach).





Let us pose Φ the free parameters, then the cost function is:

L{y[n],Φ} = H (y[n]) = H (x)− E{

log(

det(

J))}

=

= H(x) + log |det W|+ 2 ·N∑

n=1

E [log∣∣∣h′n(un)

∣∣∣]The free parameters for our architecture are:

Φ ={

wij ,Qh}





Evaluating the derivatives with respect the de-mixing matrix, weobtain:

∂ log |det W|∂W

= W−H

2∂∑

n log∣∣∣h′n(un)

∣∣∣∂W

= Ψ(u)xH

Ψ(u) is a function depending on spline parameters.

We obtain the following learning rule that is a generalization of theBell & Sejnowski algorithm:

∆W =∂L{y,Φ}∂W

= η(

W−H + Ψ (u) xH)





The function Ψ(u) = Ψ(uR, uI) + jΨ(uR, uI) is formed as:ψiR = 2

∂yiR∂uiR

∂2yiR∂u2

iR

+∂yiR∂uiI

∂∂uiR

∂yiR∂uiI(

∂yiR∂uiR

)2+(∂yiR∂uiI

)2

ψiI = 2

∂yiR∂uiI

∂2yiR∂u2

iI

+∂yiR∂uiR

∂∂uiI

∂yiR∂uiR(

∂yiR∂uiR

)2+(∂yiR∂uiI

)2

and in matrix notation:

ψiR = 2∆

( (TIi ·M·(TRi ·M·QRi)

T)(

TIi ·M·(TRi ·M·QRi)T)

(TIi ·M·(TRi ·M·QRi)

T)2

+(TIi ·M·(TRi ·M·QRi )T )2

+

+(TIi ·M·(TRi ·M·QRi )

T )(

TIi ·M·(TRi ·M·QRi)T)

(TIi ·M·(TRi ·M·QRi)

T)2

+(TIi ·M·(TRi ·M·QRi )T )2

)and similar for ψiI .





Evaluating the derivatives with respect the spline control points, we obtain:

2∂∑

n log |h′n (un)|∂Qj,iR +mR ,iI +mI

=

0 i 6= j

2

∂yjR∂ujR

∂∂Qj,iR +mR ,iI +mI

∂yjR∂ujR

+∂yjR∂ujI

∂∂Qj,iR +mR ,iI +mI

∂yjR∂ujI(

∂yjR∂ujR

)2+

(∂yjR∂ujI

)2

i = j

and in matrix notation:

∆Qj,iR +mR ,iI +mI = 2ηQ

( (TIj ·M·(TRj ·M·QRj)

T)(

TIj ·MmI·(TRj ·MmR )T

)(

TIj ·M·(TRj ·M·QRj)T)2

+(

TIj ·M·(TRj ·M·QRj)T)2 +

+

(TIj ·M·(TRj ·M·QRj)

T)(

TIj ·MmI·(TRj ·MmR )T

)(


+(


)where Mk is a zero matrix, except the k-th column, that is equal to the k-thecolumn of the M matrix.




The de-mixing algorithm for PNL case

A similar architecture can be found for the PNL mixtures case. In this newarchitecture a third learning rule can be derived for the adaptation of thespline contol point of the compensating nonlinear function:

∆QGR,i+m = TRMm

TRMQGR,i+m

+ Re{

ΨWHk

}TRMm,

∆QGI,i+m = TI Mm

TI MQGI,i+m

+ Im{

ΨWHk

}TI Mm




Phase recovery

The scaling ambiguity of BSS algorithm is reflected in a phase ambiguity (rotation)in the complex domain. We ca analyze as recover this information.

Assume that x = [x1, . . . , xN ]T e xi = [xRi , xIi ]T , and s = [s1, . . . , sN ]T e si = [sRi , sIi ]

T , we obtain:

x1Rx1Ix2Rx2I

.

.

.xNRxNI

=

a11R −a11I a12R −a21I · · · a1NR −a1NIa11I a11R a12I a12R · · · a1NI a1NRa21R −a21I a22R −a22I · · · a2NR −a2NIa21I a21R a22I a22R · · · a2NI a2NR

.

.

.

.

.

.

.

.

.

.

.

.. . .

.

.

.

.

.

.aN1R −aN1I aN2R −aN2I · · · aNNR −aNNIaN1I aN1R aN2I aN2R · · · aNNI aNNR

·

s1Rs1Is2Rs2I

.

.

.sNRsNI

that is:

x1x2

.

.

.xN

=

A11 A12 · · · A1NA21 A22 · · · A2N

.

.

.

.

.

.. . .

.

.

.AN1 AN2 · · · ANN

·

s1s2

.

.

.sN

(19)

where Aij =

[aijR −aijIaijI aijR

]e det

(Aij)

= a2ijR + a2

ijI =∣∣aij

∣∣2.




Phase recovery

We desire a similar de-mixing model

u1Ru1Iu2Ru2I

.

.

.uNRuNI

=

w11R −w11I w12R −w21I · · · w1NR −w1NIw11I w11R w12I w12R · · · w1NI w1NRw21R −w21I w22R −w22I · · · w2NR −w2NIw21I w21R w22I w22R · · · w2NI w2NR

.

.

.

.

.

.

.

.

.

.

.

.. . .

.

.

.

.

.

.wN1R −wN1I wN2R −wN2I · · · wNNR −wNNIwN1I wN1R wN2I wN2R · · · wNNI wNNR

·

x1Rx1Ix2Rx2I

.

.

.xNRxNI

that is

u1u2

.

.

.uN

=

W11 W12 · · · W1NW21 W22 · · · W2N

.

.

.

.

.

.. . .

.

.

.WN1 WN2 · · · WNN

·

x1x2

.

.

.xN

(20)

where Wij =

[wijR −wijIwijI wijR

]e det

(Wij)

= w2ijR + w2

ijI =∣∣wij

∣∣2.




Phase recovery

So we can have a phase recovery with the following constraint:

Wij =1

2

{Wij + det (Wij) ·Wij

−T}

(21)

It can be demonstrated that the previous constraint is equivalent to utilizea splitting activation function:

h (z) = hR (zR) + jhI (zI )




References

A. Hyvarinen, J. Karhunen, E. Oja.

Independent Component Analysis.Wiley, 2001.

A. Cichocki and S.-I. Amari.

Adaptive Blind Signal and Image Processing.Wiley, 2003.

A.J. Bell and T.J. Sejnowski.

An information-maximisation approach to blind separation and blind deconvolution.Neural Computation, Vol. 7, pp. 1129-1159, 1995.

H.H. Yang and S.-I. Amari.

Adaptive On-Line Learning Algorithms for Blind Separation: Maximum Entropy and Minimum Mutual Information.Neural Computation, Vol. 9, pp. 1457-1482, 1997.

S-I. Amari.

Natural gradient works efficiently in learning.Neural Computation, Vol. 10, pp. 251-276, 1998.

M. Scarpiniti, D. Vigliano, R. Parisi and A. Uncini.

Generalized Splitting Functions for Blind Separation of Complex Signals.Neurocomputing, Vol. 71, N. 10-12, pp. 2245-2270, 2008.

D. Vigliano, M. Scarpiniti, R. Parisi and A. Uncini.

Flexible Nonlinear Blind Signal Separation in the Complex Domain.International Journal of Neural System, Vol. 18, N. 2, pp. 105-122, April 2008.


Documents

Neural Networks Lesson 8 - Blind Source Separationispac.diet.uniroma1.it/scarpiniti/files/NNs/Less8.pdfIntroduction Other BSS techniques Basic Concepts The InfoMax algorithm Introduction