Information Theoretic Learning Finding structure in data... Jose Principe and Sudhir Rao University of Florida [email protected]

Information Theoretic LearningFinding structure in data ...

Jose Principe and Sudhir Rao

University of [email protected]

Outline

Structure

Connection to Learning

Learning Structure – the old view

A new framework

Applications

Structure

Patterns / Regularities

Amorphous/chaos

Interdependence between subsystems

White Noise

Connection to Learning

Type of Learning

Supervised Learning Data Desired Signal/Teacher

Reinforcement Learning Data Rewards/Punishments

Unsupervised Learning Only the Data

Unsupervised Learning

What can be done only with the data??

First Principles Examples

Preserve maximum information Auto associative memory, ART PCA, Linsker’s “informax” rule …

Extract independent features Barlow’s minimum redundancy principle, ICA etc

Learn the probability distributionGaussian Mixture Models,EM algorithm, Parametric DensityEstimation.

Connection to Self Organization

“If cell 1 is one of the cells providing input to cell 2, and if cell 1’s activity tends to be “high” whenever cell 2’s activity is “high”, then the future contributions that the firing of cell 1 makes to the firing of cell 2 should increase..”

-Donald Hebb, 1949, Neuropsychologist.

-

+

+

A

B

C

Increase wb proportional to activity of B and C

What is the purpose????

“Does the Hebb-type algorithm cause a developing perceptual network to optimize some property that is deeply connected with the mature network’s functioning as a information processing system.”

- Linsker, 1988

Linsker’s Infomax principle

( , )

( ) ( | )

( ) ( | )

R I X Y

H Y H Y X

H X H X Y

noise

X1

X2

XL-1

XL

Y

Maximize Shannon Rate I(X,Y)

w1

wL

Under Gaussian assumptions and uncorrelated noise the rate for a linear network is ,

2

2

1ln

2y

n

R

2max ( )w

E yMaximize Rate =

( 1) ( )i i iw n w n x y Hebbian Rule!!

Linear Network

Barlow’s redundancy principle

MinimumEntropyCoding

Feature 1

Feature N

Stimulus 1

Stimulus M

Independencefeaturesno redundancy

Converting an M dimensional problem N one dimensional problems

2M conditional probabilitiesrequired for an event V

P(V|stimuli)

N conditional probabilitiesrequired for an event V

P(V|Feature i)

ICA!!!

Summary 1

Self Organizing Rule

Discovering structure in Data

Global Objective Function

Extracting desired signal from the data itself

Revealing the structure through interaction of the data points

example, Infomax

example, Hebbian rule

example, PCA UnsupervisedLearning

Questions

Can we go beyond these preprocessing stages??

Can we create global cost function which extract “goal oriented structures” from the data?

Can we derive self organizing principle from such a cost function??

A big YES!!!

What is Information Theoretic Learning?

ITL is a methodology to adapt linear or nonlinear systems using criteria based on the information descriptors of entropy and divergence.

Center piece is a non-parametric estimator for entropy that:• Does not require an explicit estimation of pdf• Uses the Parzen window method which is known to be consistent and

efficient• Estimator is smooth• Readily integrated in conventional gradient descent learning• Provides a link to Kernel learning and SVMs.

• Allows an extension to random processes

ITL is a different way of thinking about data quantification

Moment expansions, in particular Second Order moments are still today the workhorse of statistics. We automatically translate deep concepts (e.g. similarity, Hebb’s postulate of learning ) in 2nd order statistical equivalents.

ITL replaces 2nd order moments with a geometric statistical interpretation of data in probability spaces.

• Variance by Entropy• Correlation by Correntopy• Mean square error (MSE) by Minimum error entropy (MEE)• Distances in data space by distances in probability spaces

Information Theoretic LearningEntropy

• Entropy quantifies the degree of uncertainty in a r.v. Claude Shannon defined entropy as

Not all random variables (r.v.) are equally random!

5-5 00

0.5

1

f x(x)

-5 0 50

0.2

0.4

x

f x(x)

dxxfxfXHxpxpXH XXSXXS ))(log()()()(log)()(

0

1

1

p p1 p2 =

p1

p2

11

1

p2

p1

p30

p p1 p2 p3 =

pk

k 1=

n

p (entropy -norm)=

( norm of – p raised power to )

Information Theoretic LearningRenyi’s Entropy

• Norm of the pdf:

dxxfXHxpXH XX

)(log

1

1)()(log

1

1)(

V f y y d=

norm– V 1 =

Renyi’s entropy equals Shannon’s as 1

Information Theoretic LearningParzen windowing

Given only samples drawn from a distribution:

Convergence:

-5 0 50

0.5

1N=10

x

f x(x)

-5 0 50

0.2

0.4

x

f x(x)

-5 0 50

0.5

1N = 1000

x

f x(x)

-5 0 50

0.2

0.4

x

f x(x)

N=10 N = 1000

N

ii

N

xxGN

xp

xpxx

1

1

)(1

)(ˆ

)(~},...,{

Kernel function

)(

)(*)()(ˆlim )(

NNhatprovided t

xGxpxp NN

Information Theoretic Learning Renyi’s Quadratic Entropy

Order-2 entropy & Gaussian kernels:

j iij

j iij

N

ii

xxGN

dxxxGxxGN

dxxxGN

dxxpXH

)(1

log

)()(1

log

)(1

log)(log)(

22

2

2

1

22

Pairwise interactions

between samples

O(N2)

Information potential, V2(X)

Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.

provides a potential field over the space of the samples

parameterized by the kernel size )(ˆ xp

Information Theoretic Learning Information Force

In adaptation, samples become information particles that interact through information forces.

Information potential:

Information force:

j iij xxG

NXV )(

1)(

222

iij

jxxG

Nx

V)(

122

2


Erdogmus, Principe, Hild, Natural Computing, 2002.

xj

xi

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

Information force within a dataset arising due to H(X)

What will happen if we allow the

particles to move under the influence

of these forces?

Information Theoretic Learning Backpropagation of Information Forces

Information forces become the injected error to the dual or adjoint network that determines the weight updates for adaptation.

Information

Forces

Adaptive

System

Adjoint

Network

Input

Desired

IT

CriterionOutput

Weight

Updates

ij

pk

p

N

n pij w

ne

ne

J

w

J

)(

)(1 1

Information Theoretic Learning Quadratic divergence measures

Kulback-Liebler Divergence:

Renyi’s Divergence:

Euclidean Distance:

Cauchy- Schwartz Distance :

Mutual Information is a special case (divergence between the joint and the product of marginals)

dxxq

xpxpYXDKL )(

)(log)();(

dxxq

xpxpYXD

1

)(

)()(log

1

1);(

dxxqxpYXDE2)()();(

))()(

)()(log();(

22

dxxqdxxp

dxxqxpYXDC


Learning System

Y q X W =Input Signal Output Signal

X Y

Desired Signal D

OptimizationInformation Measure

I Y D

Information Theoretic Learning Unifying criterion for learning from samples

Training ADALINE sample by sample Stochastic information gradient (SIG)

Theorem: The expected value of the stochastic information gradient (SIG), is the gradient of Shannon’s entropy estimated from the samples using Parzen windowing.

For the Gaussian kernel and M=1

The form is the same as for LMS except that entropy learning works with differences in samples.

The SIG works implicitly with the L1 norm of the error.

Erdogmus, Principe, Hild, IEEE Signal Processing Letters, 2003.

1

, )(1

log)(ˆn

LniinnS ee

LEeH

1

1

,

)(

)()()(ˆn

Lniin

n

Lnikkin

k

ns

ee

ixnxeeE

w

H

))1()()((1ˆ

12,

nxnxeew

Hkknn

k

nS

SIG Hebbian updates

In a linear network the Hebbian update is The update maximizing Shannon output entropy with the SIG becomes

Which is more powerful and biologically plausible?

kkk xyw

Erdogmus, Principe, Hild, IEEE Signal Processing Letters, 2003.

)()(1)(

112

kkkk

k xxyyw

yH

0 50 100 150 200 250 300 350 400 450 500-1

-0.5

0

0.5

1

1.5

2

Direction (

Gaussia

n)

Epochs

0 50 100 150 200 250 300 350 400 450 500-1

-0.5

0

0.5

1

1.5

2

Direction (

Cauchy)

Epochs

Generated 50 samples of a 2D distribution where the x axis is uniform and the y axis is Gaussian and the sample covariance matrix is 1

Hebbian updates would converge to any direction but SIG found consistently the 90 degree direction!

-4 -3 -2 -1 0 1 2 3 4-4

-3

-2

-1

0

1

2

3

4

ITL - Applications

ITL

System identification Feature extraction

Blind source separation Clustering

www.cnel.ufl.edu ITL has examples and Matlab code

Let be two r.vs with iid samples. Then Renyi’s cross entropy is given by

Renyi’s cross entropydM

jjNii yYxX 11 )(,)(

( ; ) log ( ) ( )X YH X Y p t p t dt

1 1

( ; ) log ( ; )

( ; ) ( ) ( ) ( )

1

Yp X X Y

N M

i ji j

H X Y V X Y

V X Y E p X p t p t dt

G x yMN

Using parzen estimate for the pdfs gives

“Cross” information potential and “cross” information force

21

1

1

1

)|();();(

1);(

ijji

M

j

M

jjii

ii

ji

M

ji

xyyxG

MN

yxFYxVx

YxF

yxGMN

YxV

Force between particles of two datasets

-1.5 -1 -0.5 0 0.5 1 1.5-1.5

-1

-0.5

0

0.5

1

1.5

Cross information force between two datasets arising due to H(X;Y)

Cauchy Schwartz Divergence

A measure of similarity between two datasets

)()();(2

)()(

)()(log);(

22

2

YHXHYXH

duupduup

duupupYXD

YX

YX

cs

( ; ) 0csD X Y Same probability density functions

A New ITL Framework:Information Theoretic Mean Shift

STATEMENT

oNi NNxX ,1

oN

ioo xX 1

oX

Consider a dataset with iid samples. We wish

to find a new dataset which captures “interesting structures” of the original dataset .

FORMULATION

Cost = Redundancy Reduction term + Similarity Measure Term

WeightedCombination

Information Theoretic Mean Shift

),()(min)( ocsX

XXDXHXJ

Form 1

This cost looks like a reaction diffusion equation:

Entropy term implements diffusionCauchy Schwarz implements attraction to the original data

Analogy

( )H X

( ; )cs oD X X

oX X

The weighting parameter λ squeezes the information flow through a bottleneck extracting different levels of structure in the data.

( )H X

( ; )cs oD X X

slope We can also visualize λ as a slope parameter. The previous methods used only λ=1 or ( ; ) ( )cs oD X X H X

Self organizing rule

0);();(

2)(

)(

)1(2

ok

ok XxF

XXVxF

XV

);(log2)(log)1(min)( oX

XXVXVXJ

Rewriting cost function as

Differentiating w.r.to xk={1,2,…,N} and rearranging gives

kN

jojk

N

jjk

N

jojk

N

jojojk

N

jojk

N

jjjk

k x

xxG

xxG

c

xxG

xxxG

xxG

xxxG

cxoo

o

o

1

1

1

1

1

1 )1()1()1(

Fixed Point Update!!

An Example

-4 -3 -2 -1 0 1 2 3 4-8

-6

-4

-2

0

2

4

Crescent shaped Dataset

Effect of λ

Summary 2

Starting with the Data

Modes

λ∞λ = 1λ = 0

Single Point Back to Data

Applications- Clustering

Segment data into different groups such that samples belonging to same group are “closer” to each other than samples of different groups.

Statement

The idea

Mode Finding Ability Clustering

Mean Shift – a review

)(1

)(1

, i

N

iX xxG

Nxp

02

)(2

1,

xxxxG

Nxp i

i

N

iX

Modes are stationary points of the equation,

N

ii

N

iii

xxG

xxxGxmx

1

1)()1( )(

Two variants: GBMS and GMS

Gaussian Mean ShiftGaussian Blurring Mean Shift

Single dataset XInitialize X=Xo

Two datasets X and Xo

Initialize X=Xo

N

ii

N

iii

xxG

xxxGxmx

1

1)()1( )(

N

ioi

N

ioioi

xxG

xxxGxmx

1

1)()1( )(

Connection to ITMS

);(2)()1(min)( oX

XXHXHXJ

( ) min ( )X

J X H X ( ) min ( ; )oX

J X H X X

λ = 0 λ = 1

N

ioi

N

ioioi

xxG

xxxGxmx

1

1)()1( )(

N

ii

N

iii

xxG

xxxGxmx

1

1)()1( )(

GBMS GMS

Applications- Clustering

-0.2 0 0.2 0.4 0.6 0.8 1 1.2-0.2

0

0.2

0.4

0.6

0.8

1

1.2

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0

0.5

X axisY axis

Pdf

est

imat

ion

10 Random Gaussian Clusters and its pdf plot

-0.2 0 0.2 0.4 0.6 0.8 1 1.2-0.2

0

0.2

0.4

0.6

0.8

1

1.2

GBMS result

-0.2 0 0.2 0.4 0.6 0.8 1 1.2-0.2

0

0.2

0.4

0.6

0.8

1

1.2

GMS result

Image segmentation

GMS GBMS

Applications- Principal Curves

Non linear extension of PCA. “Self-consistent” smooth curves which pass through

the “middle” of a d-dimensional probability distribution or data cloud.

A new definition (Erdogmus et al.)

A point is an element of the d-dimensional principal set ,denoted by iff is orthonormal to at least (n-d) eigenvectors of and is a strict local maximum in the subspace spanned by these eigenvectors.

d )(xg)(xp

)( dn )(xU

x

PC continued…

is a 0-dimensional principal set corresponding to modes of the data. is the 1-dimensional principal curve, is a 2-dimensional principal surface and so on …

Hierarchical structure, . . ITMS satisfies this definition (experimentally).

Gives principal curve for .31

1 dd

01

2

-12 -10 -8 -6 -4 -2 0 2 4 6 8

-10

-5

0

5

10

Principal curve of spiral data passing through the modes

Denoising

-10

0

10-10

0

10

-5

0

5

Chain of Ring Dataset

Applications -Vector Quantization

Limiting case of ITMS (λ ∞).

Dcs(X;Xo) can be seen as distortion measure between X and Xo.

Initialize X with far fewer points than Xo

( ) min ( ; )cs oX

J X D X X

Comparison

0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

ITVQ LBG

Unsupervised Learning Tasks: choose a point in a 2D space!

λ

Diff

eren

t T

asks

σ Different Scales

clustering

Principal curves

Vector Q.

Conclusions

“Goal-oriented structures” goes beyond preprocessing stages and helps us extract abstract representations in the data.

A common framework binds these “interesting structures” as different levels of information extraction from the data. ITMS achieves this and can be used for Clustering Principal Curves Vector Quantization and more …..

What’s Next?

Correntropy:A new generalized similarity measure

Correlation is one of the most widely used functions in signal processing and pattern recognition.

But, correlation only quantifies similarity fully if the random variables are Gaussian distributed.

Can we define a new function that measures similarity but it is not restricted to second order statistics?

Use the ITL framework.


Define correntropy of a random process {xt} as

We can easily estimate correntropy using kernels

The name correntropy comes from the fact that the average over the lags (or the dimensions) is the information potential (argument of Renyi’s entropy)

For strictly stationary and ergodic r. p.

dxxpxxxxEstV ststx )()())((),(

N

nmnnm xxk

NV

1

)(1ˆ

))((),(ˆstx xxkEstV

Santamaria I., Pokharel P., Principe J., “Generalized Correlation Function: Definition, Properties and Application to Blind Equalization”, IEEE Trans. Signal Proc. vol 54, no 6, pp 2187- 2186, 2006.


How does it look like? The sinewave


Properties of Correntropy:• It has a maximum at the origin ( )• It is a symmetric positive function• Its mean value is the information potential • Correntropy includes higher order moments of data

• The matrix whose elements are the correntopy at different lags is Toeplitz

2/1

0

2

2 !2

)1(),(

n

ntsnn

n

xxEn

tsV


• Correntropy as a cost function versus MSE. 2

2

,

2

( , ) [( ) ]

( ) ( , )

( )

XY

x y

E

e

MSE X Y E X Y

x y f x y dxdy

e f e de

,

( , ) [ ( )]

( ) ( , )

( ) ( )

XY

x y

E

e

V X Y E k X Y

k x y f x y dxdy

k e f e de


• Correntropy induces a metric (CIM) in the sample space defined by

• Therefore correntropy can

be used as an alternative

similarity criterion in the

space of samples.

1/ 2( , ) ( (0,0) ( , ))CIM X Y V V X Y

0.3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Liu W., Pokharel P., Principe J., “Correntropy: Properties and Applications in Non Gaussian Signal Processing”, accepted in IEEE Trans. Sig. Proc

RKHS induced by CorrentropyDefinition

For a stochastic process {Xt, tєT} with T being an index set, correntropy defined as

VX(t,s) = E[K(XtXs)]

is symmetric and positive definite.

Thus it induces a new RKHS denoted as VRKHS (VH). There is a kernel mapping Ψ such that

Vx(t,s)=< Ψ(t),Ψ(s) >VH

Any symmetric non-negative kernel is the covariance kernel of a random function and vice versa (Parzen).

Therefore, given a random function {Xt, t є T} there exists another random function {ft, t є T} such that

E[ft fs]=VX(t,s)

RKHS induced by CorrentropyDefinition

This RKHS seems very appropriate for nonlinear signal processing.

In this space we can compute using linear algorithms nonlinear systems in the input space such as:

– Matched filters– Wiener filters– Principal Component Analysis– Solve constrained optimization problems– Due adaptive filtering and controls

Documents

Information Theoretic Learning Finding structure in data... Jose Principe and Sudhir Rao University of Florida [email protected]