Artificial Neural Networks: general overview and specific ...perso.uclouvain.be/michel.verleysen/seminars/Altran bw 2spp.pdf · p Part III: Blind Source Separation (BSS) p BSS model

1

Michel Verleysen Altran 18/11/2002 - 1

Artificial Neural Networks:general overview and specific signal

processing model

November 2002


Content

p Part I: Introductionp why “artificial neural networks” (ANN) ?p what are ANNs useful for? ?p learning – generalization – overfitting

p Part II: From linear to non-linear learning modelsp Adaline (linear model)pMulti-Layer Perceptron (MLP) (non-linear model)

pstochastic gradient learning

p learning, validation and test

p Part III: Blind Source Separation (BSS)p BSS modelp statistical independencep some realizations

2


Content







Why Anns ?

Von Neumann’s computer (Human) braindeterminism fuzzy behavioursequence of instructions parallelismhigh speed slow speedrepetitive tasks adaptation to situationsprogramming learninguniqueness of solutions different solutionsex: matrix product ex: face recognition

Von Neumann’s computer (Human) braindeterminism fuzzy behavioursequence of instructions parallelismhigh speed slow speedrepetitive tasks adaptation to situationsprogramming learninguniqueness of solutions different solutionsex: matrix product ex: face recognition

3


Artificial Neural Networks

p Artificial neural networks ARE NOT :p an attempt to understand biological systemsp an attempt to reproduce the behavior of a biological system

p Artificial neural networks ARE :p a set of tools aimed to solve “perceptive” problems (“perceptive”

“rule-based”)

At least this framework, i.e. “Neural Computation”…


Why “Artificial neural networks” ?

p Structure: many computation units in parallel(McCulloch & Pitts 1943)

p Learning rule (adaptation of parameters):similar to Hebb’s rule (1949)

p And that’s all…

inputs

outputs ( ) ( ) ( )

= ∑ ∑

= =

M

j

D

iijikjk xwgwhy

0 0

12x

( ) ( ) ( )( )( )xwgwhxy 12=

4


Learning

p Give - many - examples (input-output pairs, or training samples)

p Compute the parameters to fit the examplesp Test the result!

inputs

outputs


Why Neural Computation ?

p Hypothetical problem of optical character recognition (OCR)

p If: - image 256 x 256 pixels- 8-bits pixel values (256 gray levels)

then: - 10158000 different images

p Necessity to work with features !

5


Feature Extraction

p Example of feature in OCR: ratio height/width (x1) of the character

p Histogram of feature value

x1

“a” class“b” class


Multi-dimensional Features

p Necessity to classify according to several, but not too muchfeatures

p Several: because of the overlap in histogram with 1 featurep Not too much: because classification methods are easier in small

dimensional spaces

p ANN: self-construction of features!

x1

x2 C1

C2

6


What are ANNs useful for ?

p regression

p classification

p adaptive filtering

p projection0

1

2

3

4

5

6

7

8

9

0 0.5 1 1.5 2 2.5 3 3.5



p regression

p classification


p projection 00.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 1 2 3 4

7


Classification - Regression

regressionfeaturesOutput value(continuous variable)

x1

x2

y

classificationfeaturesClass label(discrete variable)

x1

x2 C1

C2



p regression

p classification


p projection

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 0.5 1 1.5 2 2.5 3 3.5

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2 2.5 3 3.5

8



p regression

p classification


p projection


Principle

Non-linear parametricregression model

outputs

_

inputs

desired outputs(error criteria)

Parameteradaptation

9


Principle

p Model: restrictions about the possible relation


outputs

_

inputs


Parameteradaptation


Principle

p Parametric: learning parameters to adjust(contain the information)


outputs

_

inputs


Parameteradaptation

10


Principle

p Non-linear: more powerful than linear


outputs

_

inputs


Parameteradaptation


Principle

Universal approximator !


outputs

_

inputs


Parameteradaptation

11


Error criterion

p regressionp classificationp adaptive filtering

p projection

outputs

_

inputs

desiredoutputs

0

1

2

3

4

5

6

7

8

9

0 0.5 1 1.5 2 2.5 3 3.5

∑∂i

i2


Error criterion


p projection

outputs

_

inputs

desiredoutputs

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 1 2 3 4

Wrong classifications

12


Error criterion


p projection

outputs

_

inputs

desiredoutputs

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10 12 14

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10 12 14

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10 12 14

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10 12 14

Independencebetweenoutputs


Error criterion


p projection

outputs

_

inputs

desiredoutputs

Independencebetweenoutputs

13


Polynomial Curve Fitting

p Strong similarity between polynomial curve fitting and regression with neural networks

p polynoms

p neural networks

p Error criterion

p Polynoms: E is quadratic in w �min(E) is the solution of a setof linear equations

Neural networks: E is non-linear in w � local minima

∑=

=++++=M

j

jj

MM xwxwxwxwwy

0

2210 L

( )∑=

−=P

p

pp tyE1

2

( )xw ,fy =desired output(target)


Learning - generalization

p Learning: process of p finding parameters w whichpminimize the error function Ep given a set of input-output pairs (xp,tp)

0

0.5

1

1.5

2

2.5

0 1 2 3 4 5

x

t

0

0.5

1

1.5

2

2.5

0 1 2 3 4 5

x

t

w

14


0

0.5

1

1.5

2

2.5

0 1 2 3 4 5

x

t

0

0.5

1

1.5

2

2.5

0 1 2 3 4 5

x

t

Learning - generalization

p generalization: process ofp assigning an output value y top a new input xp given the parameters vector w

w


Overfitting

p Compromise betweenpmodel complexity andpnumber of learning pairs

p Neural networks are puniversal approximation models withpa (relatively) low complexity

p But overfitting must be a serious concern !

0

0.5

1

1.5

2

2.5

0 1 2 3 4 5

x

y

0

0.5

1

1.5

2

2.5

0 1 2 3 4 5

x

y

15


Overfitting in classification

x1

x2 C1

C2

x1

x2 C1

C2

x1

x2 C1

C2


The Curse of Dimensionality

p Histograms: the number of boxes is exponential with the dimension

p More features � reduced performancesp If # data is limited � dimension must be lowp Concept of intrinsic dimensionality of data

16


Content







From linear to non –linear regression

p Linear regression↑ simpler↑ easy « learning » (direct or adaptive)↓ not sufficient for most problems

p Non-linear regression↓ more complex= adaptive learning only

↑ may be adapted to most problems

17


Adaline

p Adaline: ADAptive LINear Elementp = linear classification, linear discriminant functions,...

xwy T=

=

=

DD w

w

w

x

x

MM

1

0

1

1

wx

x1x2

xD

y

wD

w1

w2

1

w0 (bias)


Adaline: learning

p Adaline: one linear outputp patterns (learning vectors) must follow:p but P>Dp P patternsp D parameters (degrees of freedom)

p � non-ideal solutionp � optimisation (of parameters w) according to a criterion:

pD

i

pii

p xwt xw T

1==∑

=

( ) ( )∑∑==

−=−=P

p

ppP

p

pp tytE1

2T

1

2xw

18


Adaline: pseudo-inverse

p Error criterion

p Inputs xp in a matrix and outputs yp in a vector

p Error criterion

( ) ( )∑∑==

−=−=P

p

ppP

p

pp tytE1

2T

1

2xw

( )

==

PDDD

P

P

P

xxx

xxx

xxx

L

MMM

L

L

K

21

222

12

121

11

21 xxxX

( )Pttt K21T =t

2TT Xwt −=E



p Error criterionp Gradient of error (function to minimize)

with respect to weights (free parameters)

( )Pjjjj xxx K21where =x

2TT Xwt −=E

∂∂

∂∂

∂∂≡

∂∂

D

T

wE

wE

wEE

L

21w

( )( )( )( ) j

j

jj

w

wwE

xtXw

wXtXwt

Xwt

TT

TTT

2TT

2 −=

−−∂

∂=

−∂

∂=∂∂

19



p Criterionp Derivative of criterion

p Minimum of error

( ) TTTT 2 XtXww

−=

∂∂E

0T

=

∂∂wE

( ) XtXXw 1T −=

2TT Xwt −=E


Adaline: iterative learning

p Pseudo-inverse requires :p all input-output pairs (xp, tp)pmatrix inversion (often ill-configured)

p necessity for iterative methods without matrix inversion

p � Gradient descent !

20


Adaline: gradient descent

p function to minimise: Ep parameters: w

p Stochastic gradient: use one input-output pair at a time!

p pseudo-inverse and gradient descent and stochastic gradient descent:same error criterion → same solution !

( ) ( ))t(

Ett

wwww

∂∂α−=+1 ( ) ( ) ( )wXtXww T21 −α+=+ tt

( ) ∑∑==

=−=P

pp

P

p

pp EtE11

2Txw

( ) ( ) ( ) kkkttt xxwww T21 −α+=+( ) ( ))t(

pEttww

ww∂∂

α−=+1


Multi-layer perceptron (MLP)

p Convention: 2 layers of weights (in literature: sometimes 3 layers of units or neurons)

p g and h can be threshold (sign) units or continuous onesp h can be linear but not g (otherwise only one layer)

x0(bias)

x1

xD

y1

yC

z0(bias)

z1

zM

( )1ijw

( )2ijw

1st layer 2nd layer

( ) ( ) ( )

= ∑ ∑

= =

M

j

D

iijikjk xwgwhy

0 0

12x

( ) ( ) ( )( )( )xwgwhxy 12=

21


Error back-propagation 1/

p Online (stochastic) algorithm considered herep Some notations

x0(bias)

x1

xD

y1

yC

z0(bias)

z1

zM

( )1ijw

( )2ijw

1st layer 2nd layer

( )lijw( )1−l

jzg

g( )lia

( )liz

( ) ( ) ( )∑

−=j

lj

lij

li zwa

1



p Gradient descent: value of

( ) ( ) ( )∑

−=j

lj

lij

li zwa

1

( )lijw

E

∂∂

( ) ( )

( )

( )

( ) ( )1−δ=

∂∂

∂∂=

∂∂

lj

li

lij

li

li

lij

z

w

a

a

E

w

E

( )( )li

li a

E

∂∂≡δ to evaluate

( )lijw( )1−l

jz

g

g( )lia

( )liz

22



p For output units

( )( )li

li a

E


( )( ) ( )

( )

( )

( )( )( )lil

i

li

li

li

li

li

a'gy

E

a

y

y

E

a

E

∂∂=

∂∂

∂∂=

∂∂=δ

known

derivative of error criterion (known)

( )lijw( )1−l

jz

g

g( )lia

( )liz



p For hidden units

( )( )li

li a

E


( )( ) ( )

( )

( )

( )

( ) ( )( )

( )

( ) ( ) ( )( )( ) ( )( ) ( ) ( )( )∑∑

∑

∑

∑

++++

∈

+

+

+

+

δ=δ=

∂

∂

δ=

∂∂

∂∂=

∂∂=δ

k

lki

lk

li

k

li

lki

lk

kl

i

lj

lj

lkj

lk

li

lk

kl

kl

i

li

wa'ga'gw

a

zw

a

a

a

E

a

E

1111

1

1

1

1

The error term (δ) is expressedas a combination of errors inthe next layer

( )lijw( )1−l

jzg

g( )lia

( )liz

g

g

g

( )1+lka

( )1+lkiw

( )1+lkz

23



p Apply an input vector xk and propagate it through the network to evaluate all activations and neuron outputs

p Evaluate error terms in output layer

p Back-propagate error terms to find error terms

p Evaluate all derivatives

p Adjust weights according to derivatives and a gradient descent scheme

( )lia

( )oiδ

( )liδ

( )1−δ li

( )( ) ( )1−δ=

∂∂ l

jl

ilij

zwE

( )liz


Back-propagation : weight adjustment 1/

p Now we have the derivatives ,

how to adjust weights according to derivatives ?

( )lijw

E

∂∂

( ) ( )tt www −+≡δ 1

24



p First-order methods :

p For a δw of a given magnitude, largest δE is found when gradient and weight update vectors are parallel

p Adaptation rule:

( )( ) ( )( )( )

( ) ( )( )

( )( )( )

ww

w

www

ww

w

w

δ

∂∂+=

−+

∂∂+=+

T

t

T

t

EtE

ttE

tEtE 11

( ) ( )( )t

Ett

wwww

∂∂α−=+1



p Improvements: momentum

p to avoid sharp changes in gradient direction (stochastic scheme,outliers, etc.)

p β =~0.9

( ) ( )( )

( ) ( )( )11 −−β+∂∂α−=+ ttEtt

tww

www

w

25



p Improvements: adaptive learning rate αp in classification problems: if underrepresented classes

pmaximum safe step size

p “risk-taking” algorithms: individual learning rates

(this is more or less the “delta-bar-delta” rule)

( ) ( ) ( )( )tij

ijijij wE

ttwtww

∂∂α−=+1

( ) ( ) ( ) ( ) κ+α=+α⇒>δ+δ tttwtw ijijijij 101( ) ( ) ( ) ( ) κ−α=+α⇒

26


Validation and test

p Basic principle: never test on learning data

Data

Learning set Test set


Validation and test

p Basic principle: never test on learning datap Corrolary: never compare models on learning data

Data

N1 N3


N2

Validation set

27


Validation and test

p Basic principle: never test on learning datap Corrolary: never compare models on learning data

p Learning set: used to lear each modelp Validation set: used to compare models and select one of themp Test set: used to estimate the error made by the selected

model

Data

N1 N3


N2

Validation set


Validation and test

p Each model: dependent on (sub)set of data→ several learnings for each model→ validation errors are averaged

p Average validations errors are compared (between models)p The best is selectedp Its error is estimated on test set

p To select subsets of data: several waysp (k-fold) (cross)-validationp bootstrap

p Particularly important with small sets of high-dimensional data!

28


Content







Cocktail party

29


Blind Source Separation (BSS) orIndependent Component Analysis (ICA)

The sourcesare unknown

The mixtureis unknown

The observationsare known

but independent


ICA principle

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10 12 14

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10 12 14

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10 12 14

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10 12 14

1° measure ofindependence

2° algorithm

unknownmixture

s1(t)

s2(t)

unknown sources

observedsignals ICA

estimationof sources

knownunknown

x1(t)

x2(t)

y1(t)

y2(t)

30


Hypotheses on the model

p Mixture is linear

p Other possible mixtures:p linear convolutivep non-linearp others (information on sources, etc.)

( ) ( ) ( )( ) ( ) ( )tsatsatx

tsatsatx

2221212

2121111

+=+=

unknownmixture

s1(t)

s2(t)

unknown sources

observedsignals ICA

estimationof sources

knownunknown

x1(t)

x2(t)

y1(t)

y2(t)


Definition of the model

p time series sj(t) → random variables sjp signals are zero-mean

Here:p linear mixtures x = A sp # observations xi = # sources sj ⇒ A: square matrix

p If A was known: s = A-1 x …

p But A is unknown: Wxys ==ˆ

hypothesis !

31


Applications of ICA

p cocktail party

p biomedical signalsp artifacts in EEGp ECG recordings from a pregnant woman

p extraction of features in natural images

p extraction of fault-related machine vibrations

p improving the prediction of a set of time series

p communications

p etc.


Statistical independence

p Toss of 2 coins

heads tailscoin 1

P(A)=1/2 P(A)=1/2

heads

tails

coin 2

P(B)=1/2

P(B)=1/2

P(A,B)=1/4

( ) ( ) ( )BPAPB,AP =

32



p continuous variables

( ) ( ) ( )bpapb,ap 21=

p1(a)

p2(b)



p For continuous variables:

p Property:

( ) ( ) ( )bpapb,ap 21=

( ) ( ) ( )bEaEabE =( ) ( )( ) ( )( ) ( )( )bhEahEbhahE 2121 =

( ) ( ) ( )bEaEabE =

p(a,b)=1/4

( ) ( ) ( )2222 10 bEaEbaE =≠=

p Independence ≠ uncorrelationp uncorrelation:

p exemple:

33


Illustration of ICA

Mixing matrix

s1

s2

−=

21

21A

x1

x2


ICA ↔ PCA

x1

x2

x1

x2

PCA

y1y2

x1

x2ICA

y1

y2

34


Ambiguities in ICA

p Amplitudes (variances, energies) of the sources

(the same for individual amplitudes)

→ we assume (ambiguity on sign remains)

p Order of sources

(P : permutation matrix)

sA

Asx kk

==

{ } 12 =isE

( )PsAPAsx 1−==


Why Gaussian variables are forbidden ?

pwithGaussian

orthogonal

Gaussiani

i xs

⇒

A( )

+π

=2

-exp21 22

21

21xx

x,xp

x1

x2

35


« Nongaussian is independent »

p Central limit theorem:Sum of independent variables → Gaussian

p

→ yi is « more Gaussian » than siyi is « least Gaussian » when yi= sj

→ find wi which maximizes the nongaussianity of wi x

s x = As y = Wx

sources observations estimations

A W

s y = Zs

sources estimations

Z


A classical measure of nongaussianity:kurtosis

p find w by

gradient descent on |kurtosis|

p Problem: sensitivity to outliers !

( ) { } { }( )224 3kurt yEyEy −=( ) 0Gausskurt ==y

36


Another measure of nongaussianity:negentropy

p Entropy

p Differential entropy

p Entropy (Gaussian) > entropy (non-Gaussian)

p Problem: computationally difficult (needs densities)

( ) ( ) ( )ii

i aYPaYPYH ==−= ∑ log

( ) ( ) ( )∫−= yyfyfyH dlog

Equal variances


Mutual information

p Natural measure of independence

p Always positive, 0 if yi independent

p roughly equivalent to maximisation of negentropy

( ) ( ) ( )yHyHy,,y,yIn

iin −=∑

=121 K

37


Summary of contrast functions

kurtosisnegentropyhigh-order cumulants

mutual informationnon-linear cross-correlationsmaximum likelihoodnon-linear PCAhigh-order cumulants

one-unitmulti-unit


Preprocessing1. Centering2. Whitening (PCA) (→ signals uncorrelated + unit variance)

s1

s2

x1

x2

x1

x2~

~

p is orthogonal → n(n-1)/2 parameters instead of n2A~

whitening

sAx ~~ =

mixture

Asx =

38


p pioneering work - Hérault-Jutten:

p non-linear decorrelation:

pmaximisation of entropy:

p one-unit rules:

Algorithms for ICA

p Adaptive algorithms

p Batch algorithms

( ) ( ) ( ) ( )111 −

−−

−= kxk'gExkgEk TT wwwxw

p FastICA (fixed-point):

( ) ( )+=+ tt WW 1 W∆

w∆ ( )xwx Tgr∝

( ) ( ) ji,ygyg ≠∝ for2211ijW∆

( ) ( )( )Wyy Tgg 211−∝W∆[ ] ( ) TT xWxW tanh21 −∝ −W∆


Extensions of the ICA problem

p Basic model: linear x = A s

p Extensions to linear model:

p # observations > # sources

p # sources > # observations

p noisy observations

p ill-conditioned mixtures

p Other extensions

p convolutive mixtures

p non-linear mixtures

39


Some realizations…

p ECG from pregnant women

SIP

CO

,(T

riest

e),1

996.

0 0.2 0.4 0.6 0.8−50

0

50Sensor signals

−5

0

5Source signals

−

0 0.2 0.4 0.6 0.8−50

0

120

−10

−5

0

5

−

1

0 0.2 0.4 0.6 0.8−100

0

50

−10

−5

0

5

−1

observations estimations

J.-F. Cardoso, Multidimensional independent component analysis, Proc. of ICASSP’98, Seattle (USA), pp. 1941-1944.



p extraction of artefacts from MEG

R. Vigario, V. Jousmäki, M. Hämäläinen, R. Hari, E. Oja, Independent component analysis for indetification of artifacts in magnetoencephalographic recordings, NIPS’97, pp. 229-235, MIT Press, 1998.

40



p Auditory evoked fieldsp 122-channel MEG data

R. Vigario, J. Särelä, V. Jousmäki, E. Oja, Independent component analysis in decomposition of audotory and somatosensoryevoked fields, Proc. of ICA’99, Aussois (France), January 1999, pp. 167-172.



A.J. Bell, T.J. Sejnowski, Edges are the‘Independent Components’ of naturalscenes, NIPS’96, pp. 831-836, MIT Press, 1996.

41



p Hand-free phone in car

N. Charkani El Hassani, Séparation auto-adaptative de sources pour des mélanges convolutifs – Application à la téléphonie mains-libres dans les voitures, thèse de doctorat, INP Grenoble, 1996.



p Multiple RF tags

Y. Deville, J. Damour, N. Charkani, Improved multi-tag radio-frequency identification systems based on new source separationneural networks, Proc. of ICA’99, Aussois (France), January 1999, pp. 449-454.

42



p Financial time series

ICA

reconstruction (4 ICs)

A.D. Back, A.S. Weigend, A First Application of Independent Component Analysis to Extracting Structure from Stock Returns, International Journal of Neural Systems, Vol. 8, No.5 (October, 1997)


Other applications…

p Proceedings of the IEEE – special issue October 1998

p receivers (QAM signals)

p equalization

p CDMA transmission

p television receivers

p seismic patterns identification

p ICA’99 and ICA 2000 conferences

pmachine vibration analysis

p astronomical images

p…

43


Cocktail party

p Speech – music separationobservations estimations

p Speech – speech separationobservations estimations

T.-W. Lee, Institute for Neural Computation, University of California (San Diego), http://www.cnl.salk.edu/~tewon/Blind/blind_audio.html



p extraction of artefacts from EEG


44



p extraction of artefacts from EEG


Documents

Artificial Neural Networks: general overview and specific ...perso.uclouvain.be/michel.verleysen/seminars/Altran bw 2spp.pdf · p Part III: Blind Source Separation (BSS) p BSS model