44
1 Michel Verleysen Altran 18/11/2002 - 1 Artificial Neural Networks: general overview and specific signal processing model November 2002 Michel Verleysen Altran 18/11/2002 - 2 Content Part I: Introduction why “artificial neural networks” (ANN) ? what are ANNs useful for? ? learning – generalization – overfitting Part II: From linear to non-linear learning models Adaline (linear model) Multi-Layer Perceptron (MLP) (non-linear model) stochastic gradient learning learning, validation and test Part III: Blind Source Separation (BSS) BSS model statistical independence some realizations

Artificial Neural Networks: general overview and specific ...perso.uclouvain.be/michel.verleysen/seminars/Altran bw 2spp.pdf · p Part III: Blind Source Separation (BSS) p BSS model

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • 1

    Michel Verleysen Altran 18/11/2002 - 1

    Artificial Neural Networks:general overview and specific signal

    processing model

    November 2002

    Michel Verleysen Altran 18/11/2002 - 2

    Content

    p Part I: Introductionp why “artificial neural networks” (ANN) ?p what are ANNs useful for? ?p learning – generalization – overfitting

    p Part II: From linear to non-linear learning modelsp Adaline (linear model)pMulti-Layer Perceptron (MLP) (non-linear model)

    pstochastic gradient learning

    p learning, validation and test

    p Part III: Blind Source Separation (BSS)p BSS modelp statistical independencep some realizations

  • 2

    Michel Verleysen Altran 18/11/2002 - 3

    Content

    p Part I: Introductionp why “artificial neural networks” (ANN) ?p what are ANNs useful for? ?p learning – generalization – overfitting

    p Part II: From linear to non-linear learning modelsp Adaline (linear model)pMulti-Layer Perceptron (MLP) (non-linear model)

    pstochastic gradient learning

    p learning, validation and test

    p Part III: Blind Source Separation (BSS)p BSS modelp statistical independencep some realizations

    Michel Verleysen Altran 18/11/2002 - 4

    Why Anns ?

    Von Neumann’s computer (Human) braindeterminism fuzzy behavioursequence of instructions parallelismhigh speed slow speedrepetitive tasks adaptation to situationsprogramming learninguniqueness of solutions different solutionsex: matrix product ex: face recognition

    Von Neumann’s computer (Human) braindeterminism fuzzy behavioursequence of instructions parallelismhigh speed slow speedrepetitive tasks adaptation to situationsprogramming learninguniqueness of solutions different solutionsex: matrix product ex: face recognition

  • 3

    Michel Verleysen Altran 18/11/2002 - 5

    Artificial Neural Networks

    p Artificial neural networks ARE NOT :p an attempt to understand biological systemsp an attempt to reproduce the behavior of a biological system

    p Artificial neural networks ARE :p a set of tools aimed to solve “perceptive” problems (“perceptive”

    “rule-based”)

    At least this framework, i.e. “Neural Computation”…

    Michel Verleysen Altran 18/11/2002 - 6

    Why “Artificial neural networks” ?

    p Structure: many computation units in parallel(McCulloch & Pitts 1943)

    p Learning rule (adaptation of parameters):similar to Hebb’s rule (1949)

    p And that’s all…

    inputs

    outputs ( ) ( ) ( )

    = ∑ ∑

    = =

    M

    j

    D

    iijikjk xwgwhy

    0 0

    12x

    ( ) ( ) ( )( )( )xwgwhxy 12=

  • 4

    Michel Verleysen Altran 18/11/2002 - 7

    Learning

    p Give - many - examples (input-output pairs, or training samples)

    p Compute the parameters to fit the examplesp Test the result!

    inputs

    outputs

    Michel Verleysen Altran 18/11/2002 - 8

    Why Neural Computation ?

    p Hypothetical problem of optical character recognition (OCR)

    p If: - image 256 x 256 pixels- 8-bits pixel values (256 gray levels)

    then: - 10158000 different images

    p Necessity to work with features !

  • 5

    Michel Verleysen Altran 18/11/2002 - 9

    Feature Extraction

    p Example of feature in OCR: ratio height/width (x1) of the character

    p Histogram of feature value

    x1

    “a” class“b” class

    Michel Verleysen Altran 18/11/2002 - 10

    Multi-dimensional Features

    p Necessity to classify according to several, but not too muchfeatures

    p Several: because of the overlap in histogram with 1 featurep Not too much: because classification methods are easier in small

    dimensional spaces

    p ANN: self-construction of features!

    x1

    x2 C1

    C2

  • 6

    Michel Verleysen Altran 18/11/2002 - 11

    What are ANNs useful for ?

    p regression

    p classification

    p adaptive filtering

    p projection0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0 0.5 1 1.5 2 2.5 3 3.5

    Michel Verleysen Altran 18/11/2002 - 12

    What are ANNs useful for ?

    p regression

    p classification

    p adaptive filtering

    p projection 00.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    0 1 2 3 4

  • 7

    Michel Verleysen Altran 18/11/2002 - 13

    Classification - Regression

    regressionfeaturesOutput value(continuous variable)

    x1

    x2

    y

    classificationfeaturesClass label(discrete variable)

    x1

    x2 C1

    C2

    Michel Verleysen Altran 18/11/2002 - 14

    What are ANNs useful for ?

    p regression

    p classification

    p adaptive filtering

    p projection

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    0 0.5 1 1.5 2 2.5 3 3.5

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    0 0.5 1 1.5 2 2.5 3 3.5

  • 8

    Michel Verleysen Altran 18/11/2002 - 15

    What are ANNs useful for ?

    p regression

    p classification

    p adaptive filtering

    p projection

    Michel Verleysen Altran 18/11/2002 - 16

    Principle

    Non-linear parametricregression model

    outputs

    _

    inputs

    desired outputs(error criteria)

    Parameteradaptation

  • 9

    Michel Verleysen Altran 18/11/2002 - 17

    Principle

    p Model: restrictions about the possible relation

    Non-linear parametricregression model

    outputs

    _

    inputs

    desired outputs(error criteria)

    Parameteradaptation

    Michel Verleysen Altran 18/11/2002 - 18

    Principle

    p Parametric: learning parameters to adjust(contain the information)

    Non-linear parametricregression model

    outputs

    _

    inputs

    desired outputs(error criteria)

    Parameteradaptation

  • 10

    Michel Verleysen Altran 18/11/2002 - 19

    Principle

    p Non-linear: more powerful than linear

    Non-linear parametricregression model

    outputs

    _

    inputs

    desired outputs(error criteria)

    Parameteradaptation

    Michel Verleysen Altran 18/11/2002 - 20

    Principle

    Universal approximator !

    Non-linear parametricregression model

    outputs

    _

    inputs

    desired outputs(error criteria)

    Parameteradaptation

  • 11

    Michel Verleysen Altran 18/11/2002 - 21

    Error criterion

    p regressionp classificationp adaptive filtering

    p projection

    outputs

    _

    inputs

    desiredoutputs

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0 0.5 1 1.5 2 2.5 3 3.5

    ∑∂i

    i2

    Michel Verleysen Altran 18/11/2002 - 22

    Error criterion

    p regressionp classificationp adaptive filtering

    p projection

    outputs

    _

    inputs

    desiredoutputs

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    0 1 2 3 4

    Wrong classifications

  • 12

    Michel Verleysen Altran 18/11/2002 - 23

    Error criterion

    p regressionp classificationp adaptive filtering

    p projection

    outputs

    _

    inputs

    desiredoutputs

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    0 2 4 6 8 10 12 14

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    0 2 4 6 8 10 12 14

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    0 2 4 6 8 10 12 14

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    0 2 4 6 8 10 12 14

    Independencebetweenoutputs

    Michel Verleysen Altran 18/11/2002 - 24

    Error criterion

    p regressionp classificationp adaptive filtering

    p projection

    outputs

    _

    inputs

    desiredoutputs

    Independencebetweenoutputs

  • 13

    Michel Verleysen Altran 18/11/2002 - 25

    Polynomial Curve Fitting

    p Strong similarity between polynomial curve fitting and regression with neural networks

    p polynoms

    p neural networks

    p Error criterion

    p Polynoms: E is quadratic in w �min(E) is the solution of a setof linear equations

    Neural networks: E is non-linear in w � local minima

    ∑=

    =++++=M

    j

    jj

    MM xwxwxwxwwy

    0

    2210 L

    ( )∑=

    −=P

    p

    pp tyE1

    2

    ( )xw ,fy =desired output(target)

    Michel Verleysen Altran 18/11/2002 - 26

    Learning - generalization

    p Learning: process of p finding parameters w whichpminimize the error function Ep given a set of input-output pairs (xp,tp)

    0

    0.5

    1

    1.5

    2

    2.5

    0 1 2 3 4 5

    x

    t

    0

    0.5

    1

    1.5

    2

    2.5

    0 1 2 3 4 5

    x

    t

    w

  • 14

    Michel Verleysen Altran 18/11/2002 - 27

    0

    0.5

    1

    1.5

    2

    2.5

    0 1 2 3 4 5

    x

    t

    0

    0.5

    1

    1.5

    2

    2.5

    0 1 2 3 4 5

    x

    t

    Learning - generalization

    p generalization: process ofp assigning an output value y top a new input xp given the parameters vector w

    w

    Michel Verleysen Altran 18/11/2002 - 28

    Overfitting

    p Compromise betweenpmodel complexity andpnumber of learning pairs

    p Neural networks are puniversal approximation models withpa (relatively) low complexity

    p But overfitting must be a serious concern !

    0

    0.5

    1

    1.5

    2

    2.5

    0 1 2 3 4 5

    x

    y

    0

    0.5

    1

    1.5

    2

    2.5

    0 1 2 3 4 5

    x

    y

  • 15

    Michel Verleysen Altran 18/11/2002 - 29

    Overfitting in classification

    x1

    x2 C1

    C2

    x1

    x2 C1

    C2

    x1

    x2 C1

    C2

    Michel Verleysen Altran 18/11/2002 - 30

    The Curse of Dimensionality

    p Histograms: the number of boxes is exponential with the dimension

    p More features � reduced performancesp If # data is limited � dimension must be lowp Concept of intrinsic dimensionality of data

  • 16

    Michel Verleysen Altran 18/11/2002 - 31

    Content

    p Part I: Introductionp why “artificial neural networks” (ANN) ?p what are ANNs useful for? ?p learning – generalization – overfitting

    p Part II: From linear to non-linear learning modelsp Adaline (linear model)pMulti-Layer Perceptron (MLP) (non-linear model)

    pstochastic gradient learning

    p learning, validation and test

    p Part III: Blind Source Separation (BSS)p BSS modelp statistical independencep some realizations

    Michel Verleysen Altran 18/11/2002 - 32

    From linear to non –linear regression

    p Linear regression↑ simpler↑ easy « learning » (direct or adaptive)↓ not sufficient for most problems

    p Non-linear regression↓ more complex= adaptive learning only

    ↑ may be adapted to most problems

  • 17

    Michel Verleysen Altran 18/11/2002 - 33

    Adaline

    p Adaline: ADAptive LINear Elementp = linear classification, linear discriminant functions,...

    xwy T=

    =

    =

    DD w

    w

    w

    x

    x

    MM

    1

    0

    1

    1

    wx

    x1x2

    xD

    y

    wD

    w1

    w2

    1

    w0 (bias)

    Michel Verleysen Altran 18/11/2002 - 34

    Adaline: learning

    p Adaline: one linear outputp patterns (learning vectors) must follow:p but P>Dp P patternsp D parameters (degrees of freedom)

    p � non-ideal solutionp � optimisation (of parameters w) according to a criterion:

    pD

    i

    pii

    p xwt xw T

    1==∑

    =

    ( ) ( )∑∑==

    −=−=P

    p

    ppP

    p

    pp tytE1

    2T

    1

    2xw

  • 18

    Michel Verleysen Altran 18/11/2002 - 35

    Adaline: pseudo-inverse

    p Error criterion

    p Inputs xp in a matrix and outputs yp in a vector

    p Error criterion

    ( ) ( )∑∑==

    −=−=P

    p

    ppP

    p

    pp tytE1

    2T

    1

    2xw

    ( )

    ==

    PDDD

    P

    P

    P

    xxx

    xxx

    xxx

    L

    MMM

    L

    L

    K

    21

    222

    12

    121

    11

    21 xxxX

    ( )Pttt K21T =t

    2TT Xwt −=E

    Michel Verleysen Altran 18/11/2002 - 36

    Adaline: pseudo-inverse

    p Error criterionp Gradient of error (function to minimize)

    with respect to weights (free parameters)

    ( )Pjjjj xxx K21where =x

    2TT Xwt −=E

    ∂∂

    ∂∂

    ∂∂≡

    ∂∂

    D

    T

    wE

    wE

    wEE

    L

    21w

    ( )( )( )( ) j

    j

    jj

    w

    wwE

    xtXw

    wXtXwt

    Xwt

    TT

    TTT

    2TT

    2 −=

    −−∂

    ∂=

    −∂

    ∂=∂∂

  • 19

    Michel Verleysen Altran 18/11/2002 - 37

    Adaline: pseudo-inverse

    p Criterionp Derivative of criterion

    p Minimum of error

    ( ) TTTT 2 XtXww

    −=

    ∂∂E

    0T

    =

    ∂∂wE

    ( ) XtXXw 1T −=

    2TT Xwt −=E

    Michel Verleysen Altran 18/11/2002 - 38

    Adaline: iterative learning

    p Pseudo-inverse requires :p all input-output pairs (xp, tp)pmatrix inversion (often ill-configured)

    p necessity for iterative methods without matrix inversion

    p � Gradient descent !

  • 20

    Michel Verleysen Altran 18/11/2002 - 39

    Adaline: gradient descent

    p function to minimise: Ep parameters: w

    p Stochastic gradient: use one input-output pair at a time!

    p pseudo-inverse and gradient descent and stochastic gradient descent:same error criterion → same solution !

    ( ) ( ))t(

    Ett

    wwww

    ∂∂α−=+1 ( ) ( ) ( )wXtXww T21 −α+=+ tt

    ( ) ∑∑==

    =−=P

    pp

    P

    p

    pp EtE11

    2Txw

    ( ) ( ) ( ) kkkttt xxwww T21 −α+=+( ) ( ))t(

    pEttww

    ww∂∂

    α−=+1

    Michel Verleysen Altran 18/11/2002 - 40

    Multi-layer perceptron (MLP)

    p Convention: 2 layers of weights (in literature: sometimes 3 layers of units or neurons)

    p g and h can be threshold (sign) units or continuous onesp h can be linear but not g (otherwise only one layer)

    x0(bias)

    x1

    xD

    y1

    yC

    z0(bias)

    z1

    zM

    ( )1ijw

    ( )2ijw

    1st layer 2nd layer

    ( ) ( ) ( )

    = ∑ ∑

    = =

    M

    j

    D

    iijikjk xwgwhy

    0 0

    12x

    ( ) ( ) ( )( )( )xwgwhxy 12=

  • 21

    Michel Verleysen Altran 18/11/2002 - 41

    Error back-propagation 1/

    p Online (stochastic) algorithm considered herep Some notations

    x0(bias)

    x1

    xD

    y1

    yC

    z0(bias)

    z1

    zM

    ( )1ijw

    ( )2ijw

    1st layer 2nd layer

    ( )lijw( )1−l

    jzg

    g( )lia

    ( )liz

    ( ) ( ) ( )∑

    −=j

    lj

    lij

    li zwa

    1

    Michel Verleysen Altran 18/11/2002 - 42

    Error back-propagation 2/

    p Gradient descent: value of

    ( ) ( ) ( )∑

    −=j

    lj

    lij

    li zwa

    1

    ( )lijw

    E

    ∂∂

    ( ) ( )

    ( )

    ( )

    ( ) ( )1−δ=

    ∂∂

    ∂∂=

    ∂∂

    lj

    li

    lij

    li

    li

    lij

    z

    w

    a

    a

    E

    w

    E

    ( )( )li

    li a

    E

    ∂∂≡δ to evaluate

    ( )lijw( )1−l

    jz

    g

    g( )lia

    ( )liz

  • 22

    Michel Verleysen Altran 18/11/2002 - 43

    Error back-propagation 3/

    p For output units

    ( )( )li

    li a

    E

    ∂∂≡δ to evaluate

    ( )( ) ( )

    ( )

    ( )

    ( )( )( )lil

    i

    li

    li

    li

    li

    li

    a'gy

    E

    a

    y

    y

    E

    a

    E

    ∂∂=

    ∂∂

    ∂∂=

    ∂∂=δ

    known

    derivative of error criterion (known)

    ( )lijw( )1−l

    jz

    g

    g( )lia

    ( )liz

    Michel Verleysen Altran 18/11/2002 - 44

    Error back-propagation 4/

    p For hidden units

    ( )( )li

    li a

    E

    ∂∂≡δ to evaluate

    ( )( ) ( )

    ( )

    ( )

    ( )

    ( ) ( )( )

    ( )

    ( ) ( ) ( )( )( ) ( )( ) ( ) ( )( )∑∑

    ++++

    +

    +

    +

    +

    δ=δ=

    δ=

    ∂∂

    ∂∂=

    ∂∂=δ

    k

    lki

    lk

    li

    k

    li

    lki

    lk

    kl

    i

    lj

    lj

    lkj

    lk

    li

    lk

    kl

    kl

    i

    li

    wa'ga'gw

    a

    zw

    a

    a

    a

    E

    a

    E

    1111

    1

    1

    1

    1

    The error term (δ) is expressedas a combination of errors inthe next layer

    ( )lijw( )1−l

    jzg

    g( )lia

    ( )liz

    g

    g

    g

    ( )1+lka

    ( )1+lkiw

    ( )1+lkz

  • 23

    Michel Verleysen Altran 18/11/2002 - 45

    Error back-propagation 5/

    p Apply an input vector xk and propagate it through the network to evaluate all activations and neuron outputs

    p Evaluate error terms in output layer

    p Back-propagate error terms to find error terms

    p Evaluate all derivatives

    p Adjust weights according to derivatives and a gradient descent scheme

    ( )lia

    ( )oiδ

    ( )liδ

    ( )1−δ li

    ( )( ) ( )1−δ=

    ∂∂ l

    jl

    ilij

    zwE

    ( )liz

    Michel Verleysen Altran 18/11/2002 - 46

    Back-propagation : weight adjustment 1/

    p Now we have the derivatives ,

    how to adjust weights according to derivatives ?

    ( )lijw

    E

    ∂∂

    ( ) ( )tt www −+≡δ 1

  • 24

    Michel Verleysen Altran 18/11/2002 - 47

    Back-propagation : weight adjustment 2/

    p First-order methods :

    p For a δw of a given magnitude, largest δE is found when gradient and weight update vectors are parallel

    p Adaptation rule:

    ( )( ) ( )( )( )

    ( ) ( )( )

    ( )( )( )

    ww

    w

    www

    ww

    w

    w

    δ

    ∂∂+=

    −+

    ∂∂+=+

    T

    t

    T

    t

    EtE

    ttE

    tEtE 11

    ( ) ( )( )t

    Ett

    wwww

    ∂∂α−=+1

    Michel Verleysen Altran 18/11/2002 - 48

    Back-propagation : weight adjustment 4/

    p Improvements: momentum

    p to avoid sharp changes in gradient direction (stochastic scheme,outliers, etc.)

    p β =~0.9

    ( ) ( )( )

    ( ) ( )( )11 −−β+∂∂α−=+ ttEtt

    tww

    www

    w

  • 25

    Michel Verleysen Altran 18/11/2002 - 49

    Back-propagation : weight adjustment 5/

    p Improvements: adaptive learning rate αp in classification problems: if underrepresented classes

    pmaximum safe step size

    p “risk-taking” algorithms: individual learning rates

    (this is more or less the “delta-bar-delta” rule)

    ( ) ( ) ( )( )tij

    ijijij wE

    ttwtww

    ∂∂α−=+1

    ( ) ( ) ( ) ( ) κ+α=+α⇒>δ+δ tttwtw ijijijij 101( ) ( ) ( ) ( ) κ−α=+α⇒

  • 26

    Michel Verleysen Altran 18/11/2002 - 51

    Validation and test

    p Basic principle: never test on learning data

    Data

    Learning set Test set

    Michel Verleysen Altran 18/11/2002 - 52

    Validation and test

    p Basic principle: never test on learning datap Corrolary: never compare models on learning data

    Data

    N1 N3

    Learning set Test set

    N2

    Validation set

  • 27

    Michel Verleysen Altran 18/11/2002 - 53

    Validation and test

    p Basic principle: never test on learning datap Corrolary: never compare models on learning data

    p Learning set: used to lear each modelp Validation set: used to compare models and select one of themp Test set: used to estimate the error made by the selected

    model

    Data

    N1 N3

    Learning set Test set

    N2

    Validation set

    Michel Verleysen Altran 18/11/2002 - 54

    Validation and test

    p Each model: dependent on (sub)set of data→ several learnings for each model→ validation errors are averaged

    p Average validations errors are compared (between models)p The best is selectedp Its error is estimated on test set

    p To select subsets of data: several waysp (k-fold) (cross)-validationp bootstrap

    p Particularly important with small sets of high-dimensional data!

  • 28

    Michel Verleysen Altran 18/11/2002 - 55

    Content

    p Part I: Introductionp why “artificial neural networks” (ANN) ?p what are ANNs useful for? ?p learning – generalization – overfitting

    p Part II: From linear to non-linear learning modelsp Adaline (linear model)pMulti-Layer Perceptron (MLP) (non-linear model)

    pstochastic gradient learning

    p learning, validation and test

    p Part III: Blind Source Separation (BSS)p BSS modelp statistical independencep some realizations

    Michel Verleysen Altran 18/11/2002 - 56

    Cocktail party

  • 29

    Michel Verleysen Altran 18/11/2002 - 57

    Blind Source Separation (BSS) orIndependent Component Analysis (ICA)

    The sourcesare unknown

    The mixtureis unknown

    The observationsare known

    but independent

    Michel Verleysen Altran 18/11/2002 - 58

    ICA principle

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    0 2 4 6 8 10 12 14

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    0 2 4 6 8 10 12 14

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    0 2 4 6 8 10 12 14

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    0 2 4 6 8 10 12 14

    1° measure ofindependence

    2° algorithm

    unknownmixture

    s1(t)

    s2(t)

    unknown sources

    observedsignals ICA

    estimationof sources

    knownunknown

    x1(t)

    x2(t)

    y1(t)

    y2(t)

  • 30

    Michel Verleysen Altran 18/11/2002 - 59

    Hypotheses on the model

    p Mixture is linear

    p Other possible mixtures:p linear convolutivep non-linearp others (information on sources, etc.)

    ( ) ( ) ( )( ) ( ) ( )tsatsatx

    tsatsatx

    2221212

    2121111

    +=+=

    unknownmixture

    s1(t)

    s2(t)

    unknown sources

    observedsignals ICA

    estimationof sources

    knownunknown

    x1(t)

    x2(t)

    y1(t)

    y2(t)

    Michel Verleysen Altran 18/11/2002 - 60

    Definition of the model

    p time series sj(t) → random variables sjp signals are zero-mean

    Here:p linear mixtures x = A sp # observations xi = # sources sj ⇒ A: square matrix

    p If A was known: s = A-1 x …

    p But A is unknown: Wxys ==ˆ

    hypothesis !

  • 31

    Michel Verleysen Altran 18/11/2002 - 61

    Applications of ICA

    p cocktail party

    p biomedical signalsp artifacts in EEGp ECG recordings from a pregnant woman

    p extraction of features in natural images

    p extraction of fault-related machine vibrations

    p improving the prediction of a set of time series

    p communications

    p etc.

    Michel Verleysen Altran 18/11/2002 - 62

    Statistical independence

    p Toss of 2 coins

    heads tailscoin 1

    P(A)=1/2 P(A)=1/2

    heads

    tails

    coin 2

    P(B)=1/2

    P(B)=1/2

    P(A,B)=1/4

    ( ) ( ) ( )BPAPB,AP =

  • 32

    Michel Verleysen Altran 18/11/2002 - 63

    Statistical independence

    p continuous variables

    ( ) ( ) ( )bpapb,ap 21=

    p1(a)

    p2(b)

    Michel Verleysen Altran 18/11/2002 - 64

    Statistical independence

    p For continuous variables:

    p Property:

    ( ) ( ) ( )bpapb,ap 21=

    ( ) ( ) ( )bEaEabE =( ) ( )( ) ( )( ) ( )( )bhEahEbhahE 2121 =

    ( ) ( ) ( )bEaEabE =

    p(a,b)=1/4

    ( ) ( ) ( )2222 10 bEaEbaE =≠=

    p Independence ≠ uncorrelationp uncorrelation:

    p exemple:

  • 33

    Michel Verleysen Altran 18/11/2002 - 65

    Illustration of ICA

    Mixing matrix

    s1

    s2

    −=

    21

    21A

    x1

    x2

    Michel Verleysen Altran 18/11/2002 - 66

    ICA ↔ PCA

    x1

    x2

    x1

    x2

    PCA

    y1y2

    x1

    x2ICA

    y1

    y2

  • 34

    Michel Verleysen Altran 18/11/2002 - 67

    Ambiguities in ICA

    p Amplitudes (variances, energies) of the sources

    (the same for individual amplitudes)

    → we assume (ambiguity on sign remains)

    p Order of sources

    (P : permutation matrix)

    sA

    Asx kk

    ==

    { } 12 =isE

    ( )PsAPAsx 1−==

    Michel Verleysen Altran 18/11/2002 - 68

    Why Gaussian variables are forbidden ?

    pwithGaussian

    orthogonal

    Gaussiani

    i xs

    A( )

    =2

    -exp21 22

    21

    21xx

    x,xp

    x1

    x2

  • 35

    Michel Verleysen Altran 18/11/2002 - 69

    « Nongaussian is independent »

    p Central limit theorem:Sum of independent variables → Gaussian

    p

    → yi is « more Gaussian » than siyi is « least Gaussian » when yi= sj

    → find wi which maximizes the nongaussianity of wi x

    s x = As y = Wx

    sources observations estimations

    A W

    s y = Zs

    sources estimations

    Z

    Michel Verleysen Altran 18/11/2002 - 70

    A classical measure of nongaussianity:kurtosis

    p find w by

    gradient descent on |kurtosis|

    p Problem: sensitivity to outliers !

    ( ) { } { }( )224 3kurt yEyEy −=( ) 0Gausskurt ==y

  • 36

    Michel Verleysen Altran 18/11/2002 - 71

    Another measure of nongaussianity:negentropy

    p Entropy

    p Differential entropy

    p Entropy (Gaussian) > entropy (non-Gaussian)

    p Problem: computationally difficult (needs densities)

    ( ) ( ) ( )ii

    i aYPaYPYH ==−= ∑ log

    ( ) ( ) ( )∫−= yyfyfyH dlog

    Equal variances

    Michel Verleysen Altran 18/11/2002 - 72

    Mutual information

    p Natural measure of independence

    p Always positive, 0 if yi independent

    p roughly equivalent to maximisation of negentropy

    ( ) ( ) ( )yHyHy,,y,yIn

    iin −=∑

    =121 K

  • 37

    Michel Verleysen Altran 18/11/2002 - 73

    Summary of contrast functions

    kurtosisnegentropyhigh-order cumulants

    mutual informationnon-linear cross-correlationsmaximum likelihoodnon-linear PCAhigh-order cumulants

    one-unitmulti-unit

    Michel Verleysen Altran 18/11/2002 - 74

    Preprocessing1. Centering2. Whitening (PCA) (→ signals uncorrelated + unit variance)

    s1

    s2

    x1

    x2

    x1

    x2~

    ~

    p is orthogonal → n(n-1)/2 parameters instead of n2A~

    whitening

    sAx ~~ =

    mixture

    Asx =

  • 38

    Michel Verleysen Altran 18/11/2002 - 75

    p pioneering work - Hérault-Jutten:

    p non-linear decorrelation:

    pmaximisation of entropy:

    p one-unit rules:

    Algorithms for ICA

    p Adaptive algorithms

    p Batch algorithms

    ( ) ( ) ( ) ( )111 −

    −−

    −= kxk'gExkgEk TT wwwxw

    p FastICA (fixed-point):

    ( ) ( )+=+ tt WW 1 W∆

    w∆ ( )xwx Tgr∝

    ( ) ( ) ji,ygyg ≠∝ for2211ijW∆

    ( ) ( )( )Wyy Tgg 211−∝W∆[ ] ( ) TT xWxW tanh21 −∝ −W∆

    Michel Verleysen Altran 18/11/2002 - 76

    Extensions of the ICA problem

    p Basic model: linear x = A s

    p Extensions to linear model:

    p # observations > # sources

    p # sources > # observations

    p noisy observations

    p ill-conditioned mixtures

    p Other extensions

    p convolutive mixtures

    p non-linear mixtures

  • 39

    Michel Verleysen Altran 18/11/2002 - 77

    Some realizations…

    p ECG from pregnant women

    SIP

    CO

    ,(T

    riest

    e),1

    996.

    0 0.2 0.4 0.6 0.8−50

    0

    50Sensor signals

    −5

    0

    5Source signals

    0 0.2 0.4 0.6 0.8−50

    0

    120

    −10

    −5

    0

    5

    1

    0 0.2 0.4 0.6 0.8−100

    0

    50

    −10

    −5

    0

    5

    −1

    observations estimations

    J.-F. Cardoso, Multidimensional independent component analysis, Proc. of ICASSP’98, Seattle (USA), pp. 1941-1944.

    Michel Verleysen Altran 18/11/2002 - 78

    Some realizations…

    p extraction of artefacts from MEG

    R. Vigario, V. Jousmäki, M. Hämäläinen, R. Hari, E. Oja, Independent component analysis for indetification of artifacts in magnetoencephalographic recordings, NIPS’97, pp. 229-235, MIT Press, 1998.

  • 40

    Michel Verleysen Altran 18/11/2002 - 79

    Some realizations…

    p Auditory evoked fieldsp 122-channel MEG data

    R. Vigario, J. Särelä, V. Jousmäki, E. Oja, Independent component analysis in decomposition of audotory and somatosensoryevoked fields, Proc. of ICA’99, Aussois (France), January 1999, pp. 167-172.

    Michel Verleysen Altran 18/11/2002 - 80

    Some realizations…

    A.J. Bell, T.J. Sejnowski, Edges are the‘Independent Components’ of naturalscenes, NIPS’96, pp. 831-836, MIT Press, 1996.

  • 41

    Michel Verleysen Altran 18/11/2002 - 81

    Some realizations…

    p Hand-free phone in car

    N. Charkani El Hassani, Séparation auto-adaptative de sources pour des mélanges convolutifs – Application à la téléphonie mains-libres dans les voitures, thèse de doctorat, INP Grenoble, 1996.

    Michel Verleysen Altran 18/11/2002 - 82

    Some realizations…

    p Multiple RF tags

    Y. Deville, J. Damour, N. Charkani, Improved multi-tag radio-frequency identification systems based on new source separationneural networks, Proc. of ICA’99, Aussois (France), January 1999, pp. 449-454.

  • 42

    Michel Verleysen Altran 18/11/2002 - 83

    Some realizations…

    p Financial time series

    ICA

    reconstruction (4 ICs)

    A.D. Back, A.S. Weigend, A First Application of Independent Component Analysis to Extracting Structure from Stock Returns, International Journal of Neural Systems, Vol. 8, No.5 (October, 1997)

    Michel Verleysen Altran 18/11/2002 - 84

    Other applications…

    p Proceedings of the IEEE – special issue October 1998

    p receivers (QAM signals)

    p equalization

    p CDMA transmission

    p television receivers

    p seismic patterns identification

    p ICA’99 and ICA 2000 conferences

    pmachine vibration analysis

    p astronomical images

    p…

  • 43

    Michel Verleysen Altran 18/11/2002 - 85

    Cocktail party

    p Speech – music separationobservations estimations

    p Speech – speech separationobservations estimations

    T.-W. Lee, Institute for Neural Computation, University of California (San Diego), http://www.cnl.salk.edu/~tewon/Blind/blind_audio.html

    Michel Verleysen Altran 18/11/2002 - 86

    Some realizations…

    p extraction of artefacts from EEG

    R. Vigario, V. Jousmäki, M. Hämäläinen, R. Hari, E. Oja, Independent component analysis for indetification of artifacts in magnetoencephalographic recordings, NIPS’97, pp. 229-235, MIT Press, 1998.

  • 44

    Michel Verleysen Altran 18/11/2002 - 87

    Some realizations…

    p extraction of artefacts from EEG

    R. Vigario, V. Jousmäki, M. Hämäläinen, R. Hari, E. Oja, Independent component analysis for indetification of artifacts in magnetoencephalographic recordings, NIPS’97, pp. 229-235, MIT Press, 1998.