Upload
adrian-nicholson
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Information Theoretic LearningFinding structure in data ...
Jose Principe and Sudhir Rao
University of [email protected]
Outline
Structure
Connection to Learning
Learning Structure – the old view
A new framework
Applications
Structure
Patterns / Regularities
Amorphous/chaos
Interdependence between subsystems
White Noise
Connection to Learning
Type of Learning
Supervised Learning Data Desired Signal/Teacher
Reinforcement Learning Data Rewards/Punishments
Unsupervised Learning Only the Data
Unsupervised Learning
What can be done only with the data??
First Principles Examples
Preserve maximum information Auto associative memory, ART PCA, Linsker’s “informax” rule …
Extract independent features Barlow’s minimum redundancy principle, ICA etc
Learn the probability distributionGaussian Mixture Models,EM algorithm, Parametric DensityEstimation.
Connection to Self Organization
“If cell 1 is one of the cells providing input to cell 2, and if cell 1’s activity tends to be “high” whenever cell 2’s activity is “high”, then the future contributions that the firing of cell 1 makes to the firing of cell 2 should increase..”
-Donald Hebb, 1949, Neuropsychologist.
-
+
+
A
B
C
Increase wb proportional to activity of B and C
What is the purpose????
“Does the Hebb-type algorithm cause a developing perceptual network to optimize some property that is deeply connected with the mature network’s functioning as a information processing system.”
- Linsker, 1988
Linsker’s Infomax principle
( , )
( ) ( | )
( ) ( | )
R I X Y
H Y H Y X
H X H X Y
noise
X1
X2
XL-1
XL
Y
Maximize Shannon Rate I(X,Y)
w1
wL
Under Gaussian assumptions and uncorrelated noise the rate for a linear network is ,
2
2
1ln
2y
n
R
2max ( )w
E yMaximize Rate =
( 1) ( )i i iw n w n x y Hebbian Rule!!
Linear Network
Barlow’s redundancy principle
MinimumEntropyCoding
Feature 1
Feature N
Stimulus 1
Stimulus M
Independencefeaturesno redundancy
Converting an M dimensional problem N one dimensional problems
2M conditional probabilitiesrequired for an event V
P(V|stimuli)
N conditional probabilitiesrequired for an event V
P(V|Feature i)
ICA!!!
Summary 1
Self Organizing Rule
Discovering structure in Data
Global Objective Function
Extracting desired signal from the data itself
Revealing the structure through interaction of the data points
example, Infomax
example, Hebbian rule
example, PCA UnsupervisedLearning
Questions
Can we go beyond these preprocessing stages??
Can we create global cost function which extract “goal oriented structures” from the data?
Can we derive self organizing principle from such a cost function??
A big YES!!!
What is Information Theoretic Learning?
ITL is a methodology to adapt linear or nonlinear systems using criteria based on the information descriptors of entropy and divergence.
Center piece is a non-parametric estimator for entropy that:• Does not require an explicit estimation of pdf• Uses the Parzen window method which is known to be consistent and
efficient• Estimator is smooth• Readily integrated in conventional gradient descent learning• Provides a link to Kernel learning and SVMs.
• Allows an extension to random processes
ITL is a different way of thinking about data quantification
Moment expansions, in particular Second Order moments are still today the workhorse of statistics. We automatically translate deep concepts (e.g. similarity, Hebb’s postulate of learning ) in 2nd order statistical equivalents.
ITL replaces 2nd order moments with a geometric statistical interpretation of data in probability spaces.
• Variance by Entropy• Correlation by Correntopy• Mean square error (MSE) by Minimum error entropy (MEE)• Distances in data space by distances in probability spaces
Information Theoretic LearningEntropy
• Entropy quantifies the degree of uncertainty in a r.v. Claude Shannon defined entropy as
Not all random variables (r.v.) are equally random!
5-5 00
0.5
1
f x(x)
-5 0 50
0.2
0.4
x
f x(x)
dxxfxfXHxpxpXH XXSXXS ))(log()()()(log)()(
0
1
1
p p1 p2 =
p1
p2
11
1
p2
p1
p30
p p1 p2 p3 =
pk
k 1=
n
p (entropy -norm)=
( norm of – p raised power to )
Information Theoretic LearningRenyi’s Entropy
• Norm of the pdf:
dxxfXHxpXH XX
)(log
1
1)()(log
1
1)(
V f y y d=
norm– V 1 =
Renyi’s entropy equals Shannon’s as 1
Information Theoretic LearningParzen windowing
Given only samples drawn from a distribution:
Convergence:
-5 0 50
0.5
1N=10
x
f x(x)
-5 0 50
0.2
0.4
x
f x(x)
-5 0 50
0.5
1N = 1000
x
f x(x)
-5 0 50
0.2
0.4
x
f x(x)
N=10 N = 1000
N
ii
N
xxGN
xp
xpxx
1
1
)(1
)(ˆ
)(~},...,{
Kernel function
)(
)(*)()(ˆlim )(
NNhatprovided t
xGxpxp NN
Information Theoretic Learning Renyi’s Quadratic Entropy
Order-2 entropy & Gaussian kernels:
j iij
j iij
N
ii
xxGN
dxxxGxxGN
dxxxGN
dxxpXH
)(1
log
)()(1
log
)(1
log)(log)(
22
2
2
1
22
Pairwise interactions
between samples
O(N2)
Information potential, V2(X)
Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.
provides a potential field over the space of the samples
parameterized by the kernel size )(ˆ xp
Information Theoretic Learning Information Force
In adaptation, samples become information particles that interact through information forces.
Information potential:
Information force:
j iij xxG
NXV )(
1)(
222
iij
jxxG
Nx
V)(
122
2
Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.
Erdogmus, Principe, Hild, Natural Computing, 2002.
xj
xi
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Information force within a dataset arising due to H(X)
What will happen if we allow the
particles to move under the influence
of these forces?
Information Theoretic Learning Backpropagation of Information Forces
Information forces become the injected error to the dual or adjoint network that determines the weight updates for adaptation.
Information
Forces
Adaptive
System
Adjoint
Network
Input
Desired
IT
CriterionOutput
Weight
Updates
ij
pk
p
N
n pij w
ne
ne
J
w
J
)(
)(1 1
Information Theoretic Learning Quadratic divergence measures
Kulback-Liebler Divergence:
Renyi’s Divergence:
Euclidean Distance:
Cauchy- Schwartz Distance :
Mutual Information is a special case (divergence between the joint and the product of marginals)
dxxq
xpxpYXDKL )(
)(log)();(
dxxq
xpxpYXD
1
)(
)()(log
1
1);(
dxxqxpYXDE2)()();(
))()(
)()(log();(
22
dxxqdxxp
dxxqxpYXDC
Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.
Learning System
Y q X W =Input Signal Output Signal
X Y
Desired Signal D
OptimizationInformation Measure
I Y D
Information Theoretic Learning Unifying criterion for learning from samples
Training ADALINE sample by sample Stochastic information gradient (SIG)
Theorem: The expected value of the stochastic information gradient (SIG), is the gradient of Shannon’s entropy estimated from the samples using Parzen windowing.
For the Gaussian kernel and M=1
The form is the same as for LMS except that entropy learning works with differences in samples.
The SIG works implicitly with the L1 norm of the error.
Erdogmus, Principe, Hild, IEEE Signal Processing Letters, 2003.
1
, )(1
log)(ˆn
LniinnS ee
LEeH
1
1
,
)(
)()()(ˆn
Lniin
n
Lnikkin
k
ns
ee
ixnxeeE
w
H
))1()()((1ˆ
12,
nxnxeew
Hkknn
k
nS
SIG Hebbian updates
In a linear network the Hebbian update is The update maximizing Shannon output entropy with the SIG becomes
Which is more powerful and biologically plausible?
kkk xyw
Erdogmus, Principe, Hild, IEEE Signal Processing Letters, 2003.
)()(1)(
112
kkkk
k xxyyw
yH
0 50 100 150 200 250 300 350 400 450 500-1
-0.5
0
0.5
1
1.5
2
Direction (
Gaussia
n)
Epochs
0 50 100 150 200 250 300 350 400 450 500-1
-0.5
0
0.5
1
1.5
2
Direction (
Cauchy)
Epochs
Generated 50 samples of a 2D distribution where the x axis is uniform and the y axis is Gaussian and the sample covariance matrix is 1
Hebbian updates would converge to any direction but SIG found consistently the 90 degree direction!
-4 -3 -2 -1 0 1 2 3 4-4
-3
-2
-1
0
1
2
3
4
ITL - Applications
ITL
System identification Feature extraction
Blind source separation Clustering
www.cnel.ufl.edu ITL has examples and Matlab code
Let be two r.vs with iid samples. Then Renyi’s cross entropy is given by
Renyi’s cross entropydM
jjNii yYxX 11 )(,)(
( ; ) log ( ) ( )X YH X Y p t p t dt
1 1
( ; ) log ( ; )
( ; ) ( ) ( ) ( )
1
Yp X X Y
N M
i ji j
H X Y V X Y
V X Y E p X p t p t dt
G x yMN
Using parzen estimate for the pdfs gives
“Cross” information potential and “cross” information force
21
1
1
1
)|();();(
1);(
ijji
M
j
M
jjii
ii
ji
M
ji
xyyxG
MN
yxFYxVx
YxF
yxGMN
YxV
Force between particles of two datasets
-1.5 -1 -0.5 0 0.5 1 1.5-1.5
-1
-0.5
0
0.5
1
1.5
Cross information force between two datasets arising due to H(X;Y)
Cauchy Schwartz Divergence
A measure of similarity between two datasets
)()();(2
)()(
)()(log);(
22
2
YHXHYXH
duupduup
duupupYXD
YX
YX
cs
( ; ) 0csD X Y Same probability density functions
A New ITL Framework:Information Theoretic Mean Shift
STATEMENT
oNi NNxX ,1
oN
ioo xX 1
oX
Consider a dataset with iid samples. We wish
to find a new dataset which captures “interesting structures” of the original dataset .
FORMULATION
Cost = Redundancy Reduction term + Similarity Measure Term
WeightedCombination
Information Theoretic Mean Shift
),()(min)( ocsX
XXDXHXJ
Form 1
This cost looks like a reaction diffusion equation:
Entropy term implements diffusionCauchy Schwarz implements attraction to the original data
Analogy
( )H X
( ; )cs oD X X
oX X
The weighting parameter λ squeezes the information flow through a bottleneck extracting different levels of structure in the data.
( )H X
( ; )cs oD X X
slope We can also visualize λ as a slope parameter. The previous methods used only λ=1 or ( ; ) ( )cs oD X X H X
Self organizing rule
0);();(
2)(
)(
)1(2
ok
ok XxF
XXVxF
XV
);(log2)(log)1(min)( oX
XXVXVXJ
Rewriting cost function as
Differentiating w.r.to xk={1,2,…,N} and rearranging gives
kN
jojk
N
jjk
N
jojk
N
jojojk
N
jojk
N
jjjk
k x
xxG
xxG
c
xxG
xxxG
xxG
xxxG
cxoo
o
o
1
1
1
1
1
1 )1()1()1(
Fixed Point Update!!
An Example
-4 -3 -2 -1 0 1 2 3 4-8
-6
-4
-2
0
2
4
Crescent shaped Dataset
Effect of λ
Summary 2
Starting with the Data
Modes
λ∞λ = 1λ = 0
Single Point Back to Data
Applications- Clustering
Segment data into different groups such that samples belonging to same group are “closer” to each other than samples of different groups.
Statement
The idea
Mode Finding Ability Clustering
Mean Shift – a review
)(1
)(1
, i
N
iX xxG
Nxp
02
)(2
1,
xxxxG
Nxp i
i
N
iX
Modes are stationary points of the equation,
N
ii
N
iii
xxG
xxxGxmx
1
1)()1( )(
Two variants: GBMS and GMS
Gaussian Mean ShiftGaussian Blurring Mean Shift
Single dataset XInitialize X=Xo
Two datasets X and Xo
Initialize X=Xo
N
ii
N
iii
xxG
xxxGxmx
1
1)()1( )(
N
ioi
N
ioioi
xxG
xxxGxmx
1
1)()1( )(
Connection to ITMS
);(2)()1(min)( oX
XXHXHXJ
( ) min ( )X
J X H X ( ) min ( ; )oX
J X H X X
λ = 0 λ = 1
N
ioi
N
ioioi
xxG
xxxGxmx
1
1)()1( )(
N
ii
N
iii
xxG
xxxGxmx
1
1)()1( )(
GBMS GMS
Applications- Clustering
-0.2 0 0.2 0.4 0.6 0.8 1 1.2-0.2
0
0.2
0.4
0.6
0.8
1
1.2
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
0
0.5
X axisY axis
est
imat
ion
10 Random Gaussian Clusters and its pdf plot
-0.2 0 0.2 0.4 0.6 0.8 1 1.2-0.2
0
0.2
0.4
0.6
0.8
1
1.2
GBMS result
-0.2 0 0.2 0.4 0.6 0.8 1 1.2-0.2
0
0.2
0.4
0.6
0.8
1
1.2
GMS result
Image segmentation
GMS GBMS
Applications- Principal Curves
Non linear extension of PCA. “Self-consistent” smooth curves which pass through
the “middle” of a d-dimensional probability distribution or data cloud.
A new definition (Erdogmus et al.)
A point is an element of the d-dimensional principal set ,denoted by iff is orthonormal to at least (n-d) eigenvectors of and is a strict local maximum in the subspace spanned by these eigenvectors.
d )(xg)(xp
)( dn )(xU
x
PC continued…
is a 0-dimensional principal set corresponding to modes of the data. is the 1-dimensional principal curve, is a 2-dimensional principal surface and so on …
Hierarchical structure, . . ITMS satisfies this definition (experimentally).
Gives principal curve for .31
1 dd
01
2
-12 -10 -8 -6 -4 -2 0 2 4 6 8
-10
-5
0
5
10
Principal curve of spiral data passing through the modes
Denoising
-10
0
10-10
0
10
-5
0
5
Chain of Ring Dataset
Applications -Vector Quantization
Limiting case of ITMS (λ ∞).
Dcs(X;Xo) can be seen as distortion measure between X and Xo.
Initialize X with far fewer points than Xo
( ) min ( ; )cs oX
J X D X X
Comparison
0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
ITVQ LBG
Unsupervised Learning Tasks: choose a point in a 2D space!
λ
Diff
eren
t T
asks
σ Different Scales
clustering
Principal curves
Vector Q.
Conclusions
“Goal-oriented structures” goes beyond preprocessing stages and helps us extract abstract representations in the data.
A common framework binds these “interesting structures” as different levels of information extraction from the data. ITMS achieves this and can be used for Clustering Principal Curves Vector Quantization and more …..
What’s Next?
Correntropy:A new generalized similarity measure
Correlation is one of the most widely used functions in signal processing and pattern recognition.
But, correlation only quantifies similarity fully if the random variables are Gaussian distributed.
Can we define a new function that measures similarity but it is not restricted to second order statistics?
Use the ITL framework.
Correntropy:A new generalized similarity measure
Define correntropy of a random process {xt} as
We can easily estimate correntropy using kernels
The name correntropy comes from the fact that the average over the lags (or the dimensions) is the information potential (argument of Renyi’s entropy)
For strictly stationary and ergodic r. p.
dxxpxxxxEstV ststx )()())((),(
N
nmnnm xxk
NV
1
)(1ˆ
))((),(ˆstx xxkEstV
Santamaria I., Pokharel P., Principe J., “Generalized Correlation Function: Definition, Properties and Application to Blind Equalization”, IEEE Trans. Signal Proc. vol 54, no 6, pp 2187- 2186, 2006.
Correntropy:A new generalized similarity measure
How does it look like? The sinewave
Correntropy:A new generalized similarity measure
Properties of Correntropy:• It has a maximum at the origin ( )• It is a symmetric positive function• Its mean value is the information potential • Correntropy includes higher order moments of data
• The matrix whose elements are the correntopy at different lags is Toeplitz
2/1
0
2
2 !2
)1(),(
n
ntsnn
n
xxEn
tsV
Correntropy:A new generalized similarity measure
• Correntropy as a cost function versus MSE. 2
2
,
2
( , ) [( ) ]
( ) ( , )
( )
XY
x y
E
e
MSE X Y E X Y
x y f x y dxdy
e f e de
,
( , ) [ ( )]
( ) ( , )
( ) ( )
XY
x y
E
e
V X Y E k X Y
k x y f x y dxdy
k e f e de
Correntropy:A new generalized similarity measure
• Correntropy induces a metric (CIM) in the sample space defined by
• Therefore correntropy can
be used as an alternative
similarity criterion in the
space of samples.
1/ 2( , ) ( (0,0) ( , ))CIM X Y V V X Y
0.3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Liu W., Pokharel P., Principe J., “Correntropy: Properties and Applications in Non Gaussian Signal Processing”, accepted in IEEE Trans. Sig. Proc
RKHS induced by CorrentropyDefinition
For a stochastic process {Xt, tєT} with T being an index set, correntropy defined as
VX(t,s) = E[K(XtXs)]
is symmetric and positive definite.
Thus it induces a new RKHS denoted as VRKHS (VH). There is a kernel mapping Ψ such that
Vx(t,s)=< Ψ(t),Ψ(s) >VH
Any symmetric non-negative kernel is the covariance kernel of a random function and vice versa (Parzen).
Therefore, given a random function {Xt, t є T} there exists another random function {ft, t є T} such that
E[ft fs]=VX(t,s)
RKHS induced by CorrentropyDefinition
This RKHS seems very appropriate for nonlinear signal processing.
In this space we can compute using linear algorithms nonlinear systems in the input space such as:
– Matched filters– Wiener filters– Principal Component Analysis– Solve constrained optimization problems– Due adaptive filtering and controls