Upload
wtyru1989
View
171
Download
3
Tags:
Embed Size (px)
Citation preview
JO
U.
EE
35
pri
co
ral definition of ivity for indepen- probability.
d q are indepen-
with Hartley’s e events d each delivers Ik
tion for the set is
he reasoned that : we use the linear . ) with inverse g-1,
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 1 of 35
Renyi’s entropy
History: Alfred Renyi was looking for the most geneinformation measures that would preserve the additdent events and was compatible with the axioms of
He started with Cauchy’s functional equation: If p andent than I(pq)=I(p)+I(q).
Apart from a normalizing constant this is compatibleinformation content I(p)=-log p. If we assume that thX={x1,...xN} have different probabilities {p1,...pN}, anbits of information, then the total amount of informa
This can be recognized as Shannon’s entropy. But there is an implicit assumption used in this equationaverage, which is not the only one that can be usedIn the general theory of means, for any function g(xthe mean can be computed as
∑=
=N
kkkIpPI
1)(
∑=
−N
kkk xgpg
1
1 ))((
JO
U.
EE
35
pri
co
nts is applied we
cond gives
ametric family of entropies.
n
case.
α 1→
)(
1
XHS
k
=
−α
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 2 of 35
Applying this definition to the I(P) we get
When the postulate of additivity for independent eveget just two possible g(x):
g(x)=cxg(x)=c-2(1-α)x
The first form gives Shannon information and the se
for non negative α different from 1. This gives a parinformation measures that are called today Renyi’s
It can be shown that Shannon is a special case whe
So Renyi’s entropies contain Shannon as a special
∑=
−=N
kkk IgpgPI
1
1 ))(()(
)log(1
1)(1∑=−
=N
kkpPI α
α α
1lim
loglimlog
11lim)(lim
1
111
111
ppppXH
N
kkk
N
kN
kk −
⋅=
−=
→
==→
=→→
∑∑∑
α
α
αα
ααα α
JO
U.
EE
35
pri
co
d of weighting the the sum and its
e PMF.
.v. exist on what is
∑=k
kpXV αα )(
1
p3
p1 p2 p3, ,( )=
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 3 of 35
Meaning of α
If we compare the two definitions we see that instealog pk by the probabilities, here the log is external toargument is the α power of the PMF, i.e.
Note that Vα(X) is the argument of the α norm of th
A geometric view will help here. All the PMFs of N rcalled the simplex, in an N dimensional space.
( ) ( )))((log)(log()(log1
1)( 11
1 −−
− −=−=−
= αα
αααα α
XVEXVXVXH
0
1
1
p p1 p2,( )=
p1
p2
1
1
p2
p1
0
p
pkα
k 1=
n
∑ p αα (entropy α-norm)=
(α norm of – p raised power to α)
JO
U.
EE
35
pri
co
to the origin, and Changing α
α-norm of the flexible than Shan-
[ 0 1 0 ]
[ 0 1 0 ]
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 4 of 35
The norm measures exactly the distance of the PMFα is the designated l-norm (Euclidean norm is l=2). changes the metric distance in the simplex.
From another perspective, the α root of Vα(X) is thePMF. So we conclude that Renyi’s entropy is more non and includes Shannon as a special case.
We will be using extensively α=2.
[ 0 0 1 ]
[ 1 0 0 ] [ 0 1 0 ]α = 0.2
[ 0 0 1 ]
[ 1 0 0 ] [ 0 1 0 ]α = 0.5
[ 0 0 1 ]
[ 1 0 0 ] α = 1
[ 0 0 1 ]
[ 1 0 0 ] [ 0 1 0 ]α = 2
[ 0 0 1 ]
[ 1 0 0 ] [ 0 1 0 ]α = 4
[ 0 0 1 ]
[ 1 0 0 ] α = 20
JO
U.
EE
35
pri
co
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 5 of 35
Properties of Renyi’s entropies
JO
U.
EE
35
pri
co
alled the quadratic
d in economics.
, which is the 2-t it is the E[p(X)] or is the mean of the ation.
se we have found
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 6 of 35
Renyi’s quadratic entropyWe will be using heavily Renyi’s entropy with α=2, centropy
It has been used in physics, in signal processing an
We want also to stress that the argument of the lognorm of the PMF V2(X) has meaning in itself. In facif we make the change in variables ξk=p(xk), then it transformed variable when the PMF is the transform
For us Reny’s quadratic entropy is appealing becauan easy way to estimate it directly from samples.
)log()( 22 ∑−=
kkpXH
x
p(x)
Mean p(X)
mean of X
JO
U.
EE
35
pri
co
extends to continu-
ve, in fact it can
dx)
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 7 of 35
Extensions to continuous variables.
It is possible to show that Renyi’s entropy measure ous r.v. and reads
and for the quadratic entropy
Notice however that now entropy is no longer positibecome arbitrarily large negative.
∫−=−=
∞→xpnPIXH nn
(log1
1)log)((lim)( α
αα α
∫−= dxxpXH )(log)( 22
JO
U.
EE
35
pri
co
d kernel density is now called many important
ples and sum with
nbiased and con-
lly a Gaussian is
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 8 of 35
Estimation of Renyi’s quadratic entropy
We will use here ideas from density estimation calleestimation. This is an old method (Rosenblatt) that Parzen density estimation because Parzen proved statistical properties of the estimator.
The idea is very simple: Place a kernel over the samproper normalization, i.e.
The kernel has the following properties:
Parzen proved that the estimator is asymptotically usistent with a good efficiency. The free parameter is called the kernel size. Normaused and σ becomes the standard deviation.
∑=
−=
N
i
iX
xxN
xp1
)(1)(ˆσ
κσ
1. ∞<kRsup
2. ∫ ∞<R
k
3. 0)(lim =∞→
xxkx
4. ∫ =≥R
dxxkxk 1)(,0)(
JO
U.
EE
35
pri
co
ing the PDF that is 2-norm of the PDF. lds immediately
tential. ute the integral ns is the value of a er kernel size). (N2), and there is
the data. ost cases
)
)⎟⎟⎠
⎞i
dx
dxx
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 9 of 35
But notice that here we are not interested in estimata function, we are interested in a single number the For a Gaussian kernel, substituting the estimator yie
We call this estimator for V2(X), the Information PoLet us look at the derivation: We never had to compexplicitly since the integral of the product of GaussiaGaussian at the difference of arguments (with a largLet us also look at the expression: The algorithm is Oa free parameter σ that the user has to select from Cross validation or the Silverman’s rule suffice for m
))(1log(
()(1log
()(1log
)(1log)(ˆ
1 122
1 12
1 12
2
12
∑∑
∑∑ ∫
∫ ∑∑
∫ ∑
= =
= =
∞
∞−
∞
∞− = =
∞
∞− =
−−=
−⋅−−=
⎜⎜⎝
⎛−⋅−−=
⎟⎟⎠
⎞⎜⎜⎝
⎛−−=
N
i
N
jij
N
i
N
jij
N
i
N
jj
N
ii
xxGN
xxGxxGN
xGxxGN
dxxxGN
XH
σ
σσ
σσ
σ
( ) 41
11 )12(4 +−− += dXopt dNσσ
JO
U.
EE
35
pri
co
y’s entropy of any
al mean
imators.
])(X
1
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 10 of 35
Extended Estimator for Renyi’s Entropy
We can come up with still another estimator for Renorder
by approximating the expected value by the empiric
to yield
We will now present a set of properties of these est
[log1
1)(log
11
)( 1pEdxxpXH XXX−
∞
∞−
Δ
−=
−= ∫ αα
α αα
∑=
−
−≈
N
jjX xp
NXH
1
1 )(1log1
1)( αα α
∑ ∑
∑ ∑
=
−
=
=
−
=
⎟⎟⎠
⎞⎜⎜⎝
⎛−
−=
⎟⎟⎠
⎞⎜⎜⎝
⎛−
−=
N
j
N
iij
N
j
N
iij
xxN
xxNN
XH
1
1
1
1 1
)(1log1
1
)(11
log1
1)(ˆ
α
σα
α
σα
κα
κα
JO
U.
EE
35
pri
co
opy using Gaussian
kernel size.
the estimator of Eq.
caling property
mean of the
nnon’s entropy. The
’s entropy estimated
mple mean.
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 11 of 35
Properties of the EstimatorsProperty 2.1: The estimator of Eq. (2.18) for Renyi’s quadratic entr
kernels only differs from the IP of Eq. (2.14) by a factor of 2 in the
Property 2.2: For any Parzen kernel that obeys the relation
∫∞
∞−
−⋅−=− dxxxxxxx jold
iold
ijnew )()()( κκκ (2.19)
the estimator of Renyi’s quadratic entropy of Eq. (2.18) matched (2.14) using the IP.
Property 2.3. The kernel size must be a parameter that satisfies the sccxxc /)/()( σσ κκ = for any positive factor c
Property 2.4. The entropy estimator in Eq. (2.18) is invariant to the underlying density of the samples as is the actual entropy
Property 2.5. The limit of Renyi’s entropy as 1→α is Sha
limit of the entropy estimator in Eq. (2.18) as 1→α is Shannon
using Parzen windowing with the expectation approximated by the sa
JO
U.
EE
35
pri
co
rty of the actual ariable X is estimated ,cxN} of a random
al random vector X
e product of single-
and estimate of the
tent if the Parzen
e iid samples.
en ξ = 0, then the n all samples are
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 12 of 35
Properties of the EstimatorsProperty 2.6. In order to maintain consistency with the scaling propeentropy, if the entropy estimate of samples {x1,…,xN} of a random vusing a kernel size of σ, the entropy estimate of the samples {cx1,…variable cX must be estimated using a kernel size of |c|σ.
Property 2.7. When estimating the joint entropy of an n-dimension
from its samples {x1,…,xN}, use a multi-dimensional kernel that is th
dimensional kernels. This way, the estimate of the joint entropy
marginal entropies are consistent.
Theorem 2.1. The entropy estimator in Eq. (2.18) is consis
windowing and the sample mean are consistent for the actual PDF of th
Theorem 2.2. If the maximum value of the kernel κσ(ξ) is achieved whminimum value of the entropy estimator in Eq. (2.18) is achieved wheequal to each other, i.e., x1=…= xN = c
JO
U.
EE
35
pri
co
able, symmetric and
entropy estimator in
emi-definite Hessian
Theorem 2.3, then in
estim ator approaches
.
multi-dimensional
trices, R, then the
actual entropy of a
ion requires hyper-
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 13 of 35
Properties of the EstimatorsTheorem 2.3. If the kernel function κσ(.) is continuous, differenti
unimodal, then the global minimum described in Theorem 2.2 of the
Eq. (2.18) is smooth, i.e., it has a zero gradient and a positive s
matrix.
Property 2.8. If the kernel function satisfies the conditions in
the limit, as the kernel size tends to infinity, the quadratic entropy
to the logarithm of a scaled and biased version of the sample variance
Property 2.9. In the case of joint entropy estimation, if the
kernel function satisfies )()( 1ξκξκ −ΣΣ = R for all orthonormal ma
entropy estimator in Eq. (2.18) is invariant under rotations as is the
random vector X. Notice that the condition on the joint kernel funct
spherical symmetry.
JO
U.
EE
35
pri
co
or)
theory. The IP
om variable with the
and only if (iff) the ite-sample case.
)]( Xf ′′
∞→Nas,σ
])lx
4)1()2
NNbN
σ−+−
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 14 of 35
Properties of the Estimators
Bias of the Information Potential
Variance of the Information Potential
The AMISE (asymptotic mean integrated square err
This is expected from the kernel density estimation belongs to this class of estimators.
Theorem 2.4. )()ˆ()(ˆlim XHXHXHN ααα ≥=
∞→, where X̂ is a rand
PDF (.)*(.) σκXf . The equality (in the inequality portion) occurs ifkernel size is zero. This result is also valid on the average for the fin
[)2/()()](ˆ[)](ˆ[ 22 EdxxfXVEXVBias =−= ∫ σ
≈−+−−=−=Na
NNbNNNaNXVEXVEXVVar )1()2)(1()))(ˆ(())(ˆ())(ˆ( 4
22
σ
[ ] [ ] [)]([
()()()(
ji
jjiljji
xxKVarb
xKExxKExxKxxKEa
−=
−−−−−=
σ
σσσσ
24
2 )(1())((
2]))()(ˆ([))(ˆ(
NNaNdxxfdxXVXVEXVAMISE
σ −+′′=−= ∫∫
JO
U.
EE
35
pri
co
tial Estimator
ink of the gravita-
icles) of unit mass by the Parzen ker-
density estimation
e of the samples
− jii xxx ))(
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 15 of 35
Physical Interpretation of the Information Poten
There is a useful physical interpretation of the IP. Thtional or the electrostatic field. Assume the samples are particles (information partthat interact with each other with the rules specifiednel used. This analogy is exact!The information potential field is given by the kernel
the Information Potential is the total potential
Now if there is potential there are forces in the spacthat can be easily computed as
∑=
Δ−=
N
iijj xxG
NxV
122 )(1)(ˆ
σ
∑=
=N
jjxVNXV
122 )(ˆ/1)(ˆ
∑∑==
−=−′=∂∂ N
ij
N
iijj
jxG
NxxG
NxV
x 122
122 (
21)(1)(ˆ
σσ σ
∑=
Δ
Δ
=∂∂=
−′=
N
iijj
jj
ijij
xxFxVx
xF
xxGN
xxF
1222
22
);()()(
)(1);( σ
JO
U.
EE
35
pri
co
enter of 2-D space
les is to create two
rs. As we can see nel.
∑=
N
i
ij
iVN 1
)(ˆ1
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 16 of 35
Interaction between one information particle in the c
Another way to look at the interactions among sampmatrices in the space
D is a matrix of distances, and ζ is a matrix of scalathe samples create a metric space given by the ker
⎩⎨⎧
==−==
)(ˆ}ˆ{},{
2 ijijij
jiijij
GVVdD
dxxd
σς
∑∑
∑
∑
= =
=
=
==
⎪⎪⎩
⎪⎪⎨
⎧
−=
=
N
i
N
jij
N
jij
N
jij
VN
XV
dVN
iF
VN
iVfields
1 12
12
1
ˆ1)(ˆ
ˆ1)(ˆ
ˆ1)(ˆ
σ
JO
U.
EE
35
pri
co
ntropy is maxi- their size is the
1
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 17 of 35
Example of forces pulling apart samples when the emized. The lines show the direction of the force andintensity of the pull.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
JO
U.
EE
35
pri
co
xtends to any s the interaction
IP. α<2 empha- for α>2).
)(ˆ) 2 jj xVx
)( ij xx ⎟⎟⎠
⎞−
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 18 of 35
Extension to any α and kernel
The framework of information potential and forces eParzen kernel where the shape of the kernel createfield. Changing α also changes the field. The α information potential field is
and it can be expressed as a function of the IP as
The α information force is likewise given by
so conceptually all fields can be composed from thesizes regions of lower density of samples (opposite
∑
∑
=
−
=−
Δ
=
⎟⎟⎠
⎞⎜⎜⎝
⎛−=
N
jj
N
iijj
xVN
XV
xxN
xV
1
1
11
)(ˆ1)(ˆ
)(1)(ˆ
αα
α
σαα κ
(ˆ)(1)(1)(ˆ 2
1
2
12
N
iij
N
iijj pxx
Nxx
NxV −
=
−
=− =−⎟⎟
⎠
⎞⎜⎜⎝
⎛−= ∑∑ α
σ
α
σαα κκ
)(ˆ)(ˆ)1(
)(1)(ˆ)(ˆ
22
1
2
11
jjX
N
i
N
iijj
jj
xFxp
xxN
xVx
xF
−
=
−
=−
Δ
−=
⎜⎜⎝
⎛′⎟⎟
⎠
⎞⎜⎜⎝
⎛−−=
∂∂= ∑∑
α
σ
α
σααα
α
κκα
JO
U.
EE
35
pri
co
ropy definition iscussed in ch 1.
))
1α
⎟⎟⎠
⎞−
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 19 of 35
Divergence Measures
Renyi proposed a divergence measure from his entwhich is different from Kullback Liebler divergence d
Renyi’s divergence
with properties
∫∞
∞−
−Δ
⎟⎟⎠
⎞⎜⎜⎝
⎛−
= dxxgxfxfgfD
1
)()()(log
11)||(
α
α α
i. 0,,,0)||( >∀≥ αα gfgfD
ii. 0)||( =gfDα iff f(x) = g(x) ℜ∈∀x
iii. )||()||(lim1
gfDgfD KL=→
αα
)||(ˆ
)(
)(1log
11
(ˆ(ˆ1log
11
)()(log
11)||(
1
1
1
1
1
1
gfDxx
xx
N
xgxf
NxgxfEgfD
N
jm
i
gi
gj
g
N
i
fi
fj
f
N
jgj
fj
p
α
α
α
α
κ
κ
α
αα
=
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
−
−
−
⎜⎜⎝
⎛
−≈
⎥⎥⎦
⎤
⎢⎢⎣
⎡⎟⎟⎠
⎞⎜⎜⎝
⎛−
=
∑∑
∑
∑
=
−
=
=
=
−
JO
U.
EE
35
pri
co
etween the joint
e approximate the
ver Renyi’s diver-annon measure of information is .
−1 α
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 20 of 35
Reny’s Mutual Information
Renyi’s mutual information is also the divergence band the product of the marginals
and we can come up with a kernel estimate for it if wE[.] by the empirical mean
that also approximates KL estimator for α->1. Howegence and MI are not as general as KL because Shinformation is the only one for which the increase inequal to the negative of the decrease in uncertainty
∫ ∫∏
∞
∞−
∞
∞−
=
−
Δ
−= n
n
o
oX
nX dxdx
xp
xxpXI
o
...)(
),...,(log
11
)( 1
1
1
1
α
α
αL
∑∏ ∑
∑∏
∑∏ ∑
∑
=
= =
= =
=
−
= =
=ΣΔ
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛−
⎟⎟⎠
⎞⎜⎜⎝
⎛−
−=
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛−
⎟⎟⎠
⎞⎜⎜⎝
⎛−
−=
N
jn
o
oi
oj
N
i
N
i
n
o
oi
oj
N
jn
o
oi
oj
N
i
N
iij
xxN
xxN
N
xxN
xxN
NXI
o
o
o
1
1 1
1 1
1
1
1 1
1
)(1
)(11log
11
)(1
)(11log
11)(ˆ
σ
σ
α
σ
α
κ
κ
α
κ
κ
α
JO
U.
EE
35
pri
co
plex issue and be used. If we the FIsher infor-he geodesic dis-. This is related to ence with α=1/2)
a linear projection This resembles ’s divergence
tropy (the IP) we ivergences called
e between f and g.
k
)] 2/1)dx
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 21 of 35
Quadratic Divergence and Mutual Information
Measuring dissimilarity in probability space is a comthere are many other divergence measures that cantransform the simplex to an hypersphere preservingmation, the coordinates of each PMF become . Ttance between PMF f and g becomes the Bhattacharyya distance (which is Renyi’s diverg
Now if we measure the distance between f and g inspace (the cordal distance) we get the Hellinger distance which is also related to Reny
Since we have an estimator for quadratic Renyi’s enwill substitute α=1/2 for 2 and define two quadratic dEuclidean distance and Cauchy Schwarz divergenc
pk
DGcos fk g∑=
( )∫−= dxxgxfgfDB )()(ln),(
DH fk( gk )0.5
–∑=
( ) ([2/12()(12)()(),( ∫∫ −=⎥⎦
⎤⎢⎣⎡ −= xgxfdxxgxfgfDH
JO
U.
EE
35
pri
co
as
d with the IP. The d so will be called the potential cre-mples of the other
IED between two
dent. s is also possible
a divergence.
∫+ dxxgx 2)(
)
))ix
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 22 of 35
Euclidean distance between PDFs
We will drop the square root and define the distance
Notice that each of the three terms can be estimatemiddle term involves samples from the two PDFs anthe Cross Information Potential (CIP) that measuresated by one PDF in the locations specified by the saPDF.
We will define the quadratic mutual information QMrandom variables as
Notice that the IED is zero if the two r.v. are indepenThe extension to multidimensional random variable
Although a distance, we sometimes refer to DED as
∫ ∫ ∫−=−= dxgxfdxxfdxxgxfgfDED22 )()(2)())()((),(
)()(),,((),( 212121 2121xfxfxxfDXXI XXXXEDED =
( ) (,,...(),...,(1
11 ∏=
=k
iXkXEDkED fxxfDXXI
i
JO
U.
EE
35
pri
co
a distance but can ality
e is defined as
(x) and it is sym-We will also work also write
QMICS as
can also be
))() dxxg
x2) )
))i
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 23 of 35
Cauchy Schwarz divergence
The other divergence is related to the Bhattacharyybe formally derived from the Cauchy Schwarz inequ
where the equality holds iff f(x)=g(x). The divergenc
DCS is always greater than zero, it is zero for f(x)=gmetric (but does not obey the triangular inequality). with the square of this quantity in practice. We can
We can also define the quadratic mutual information
and as before for independent variables it is zero. Itextended to multiple variables easily
∫∫ ∫ ≥ dxxgxfdxxgdxxf )()()()( 22
∫ ∫∫−=
dxxgdxxf
dxxgxfgfDCS
)()(
)()(log),(
22
(log(2))(log())(log(),( 22 ∫∫ ∫ −+= xfdxxgdxxfgfDCS
ICS X1 X2,( ) DCS fX1X2x1 x2,( ) fX1
x1( ) fX2(,(=
( ) (,,...(),...,(1
11 ∏=
=k
iXkXCSkCS xfxxfDXXI
i
JO
U.
EE
35
pri
co
py
in terms of Renyi’s
Reny’s cross
non’s mutual infor-yi’s quadratic
))(2 2dxxg =∫
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 24 of 35
Relation between DCS and Renyi’s relative entro
There is a very interesting interpretation of the DCS entropy. Lutwak defines Renyi’s relative entropy as
for α=2 this gives exactly DCS, so
where the first term can be shown to be quadratics entropy. There is a striking similarity between DCS and Shanmation, but notice that here we are dealing with Renentropy.
( ) ( )( ) )1(
1
/11
11
)(
)()()(log),(
ααα
αααα
α
−
−−
∫
∫∫=
R
RRR
xf
xgxfxggfD
)(2/1)(2/1);(
/1log())(2/1log())()(log(),(
222
2
YHXHYXH
dxxfdxxgxfYXDCS
−−=
−−= ∫∫
JO
U.
EE
35
pri
co
ormation
SE C. PRINCIPEOF FLORIDA
L 6935
2-392-2662
pyright 2009
page 25 of 35
Geometric Interpretation of Quadratic Mutual Inf
Let us define
therefore QMIED and QMIED can be rewritten as
The Figure illustrates these distances in the simplex
VJ fX1X2x1 x2,( )2 x1d x2d∫∫=
VM fX1x1( )fX2
x2( )( )2 x1d x2d∫∫=
Vc fX1X2x1 x2,( )fX1
x1( )fX2x2( ) x1d x2d∫∫=⎩
⎪⎪⎨⎪⎪⎧
IED VJ 2Vc VM+–=
ICS VJ 2 Vclog–log VMlog+=⎩⎨⎧
0
IED Euclidean Distance( )fX1 X2x1 x2,( )
fX1x1( )fX2
x2( )
ICS θcos( )2( )log–=
VJ
VM
Is K-L Divergence( )
Vc θcos VJVM=
θ
JO
U.
EE
35
pri
co
PX11
Is
P X11
IED
PX11
IC S
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 26 of 35
Example
X1
X2
2
21
1PX2
1
PX2
2
PX1
1 PX1
2
PX11
PX12
PX22
PX21
⎩
⎨
⎧
⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛
=
−=
∑∑
∑∑∑∑
∑∑
= =
= == =
= =
n
i
m
jXXX
n
i
m
jXX
n
i
m
jX
CS
n
i
m
jXXXED
jPiPjiP
jPiPjiP
XXI
jPiPjiPXXI
1 1
21
1 1
21
1 1
2
21
1 1
221
))()(),((
))()(()),((
log),(
))()(),((),(
2
2
21
Fix PX1= (0.6,0.4)
VaryP(1,1) -> [0, 0.6]P(2,1) -> [0, 0.4]
PX11PX
21
P X21
I s
IED
PX21
ICS
PX11
PX21
PX21
PX11
PX21
JO
U.
EE
35
pri
co
aces
ework and the dis- its own potential cess, so the forces
i
c
ci
g
i
c
xV
VxV
xV
∂∂
−∂∂
∂∂
ˆˆ2ˆ
ˆ2
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 27 of 35
Information Potential and Forces in the Joint Sp
The important lesson to be learned from the IP framtances measures defined, is that each PDF createsthat interacts with the others. This is an additive proare also additive, and it becomes relatively simple.
Euclidean and Cauchy Schwarz divergences
The three potentials needed are
and the divergences and forces are written as
∑∑
∑∑
∑∑
= =
= =
= =
−=
−=
−=
N
i
N
jigfc
N
i
N
jiggg
N
i
N
jifff
jxixGN
V
jxixGN
V
jxixGN
V
1
222
1
222
1
222
))()((1ˆ
))()((1ˆ
))()((1ˆ
σ
σ
σ
2ˆ
ˆˆlogˆ),(ˆ
ˆ2ˆˆˆ),(ˆ
c
gfCSCS
cgfEDED
VVV
VgfD
VVVVgfD
==
−+==
gi
f
fi
CS
i
g
i
f
i
ED
VxV
VxV
xV
xV
xV
+∂∂
=∂∂
−∂∂
+∂∂
=∂∂
ˆ1ˆ
ˆ1ˆ
ˆˆˆ
JO
U.
EE
35
pri
co
have two deal with ything is still addi-
⎟⎟⎠
⎞
=
−
k
jxixG 222
)104.2()
2,1
))()((σ
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 28 of 35
Quadratic Mutual Information
The QMIs are a bit more detailed because now we the joints, the marginals and their product. But evertive. The three PDFs are
they create the following fields
⎪⎪⎪
⎩
⎪⎪⎪
⎨
⎧
−=
−=
−=
∑
∑
∑
=
=
=
N
iX
N
iX
N
iXX
ixxGN
xf
ixxGN
xf
iGN
xxf
1222
1111
121
))((1)(ˆ
))((1)(ˆ
))((1),(ˆ
2
1
21
σ
σ
σ xx
∑ ∑∑
∑∑
∑∑ ∑∑
= ==
= =
= = = =
⎜⎜⎝
⎛−⎟
⎟⎠
⎞⎜⎜⎝
⎛−=
−==
−=−=
N
i
N
j
N
jC
N
i
N
jkkkM
N
i
N
j
N
i
N
jJ
jxixGN
jxixGNN
V
jxixGN
VwithVVV
jxixGN
jiGN
V
1 1222
1112
1 12221
1 1 1 1112222
)()((1))()((11ˆ
)),()((1ˆˆˆˆ
))()((1))()((1ˆ
σσ
σ
σσ xx
JO
U.
EE
35
pri
co
e
)102.2(
))((
))((
222
211
22
−
⎥⎥⎦
⎤−
=
N
j
dxjxxG
dxdxjxxG
σ
σ
( )2
222
212
22
21
))()((
ˆˆ))()
))()((
ˆˆ))
⎟⎟
⎠
⎞⎟⎟⎠
⎞−
⎟⎟⎠
⎞−
⎟⎟⎠
⎞−
−+
jxix
VVjxi
jxix
VV
σ
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 29 of 35
let us exemplify for VC
Notice that Vc has complexity O(N3). Finally we hav
))()((1))()((11
))(())(())((111
1))((1))(())((1
)(ˆ)(ˆ),(ˆˆ
1222
1 1112
12211111
1 1
111
12211
212121
⎥⎥⎦
⎤
⎢⎢⎣
⎡−⎥
⎦
⎤⎢⎣
⎡−=
−−−=
⎢⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡−⎥
⎦
⎤⎢⎣
⎡−−=
=
∑∑ ∑
∑ ∫∫∑ ∑
∑∑∫∫ ∑
∫∫
== =
== =
==
N
j
N
k
N
i
N
k
N
i
N
j
N
i
N
k
C
jxkxGN
ixkxGNN
kxxGdxkxxGixxGNNN
NixxG
NkxxGkxxG
N
dxdxxfxfxxfV
σσ
σσσ
σσσ
1 11112
221 1
1122
21
1 12
1112
2221 1
112221
1))()((11
(())()((1
log),(ˆ
1))()((11
()(())()((1),(ˆ
⎜⎜
⎝
⎛⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛−
⎜⎜⎝
⎛−
=
⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛−−
−−=
∑ ∑∑
∑∑
∑ ∑∑
∑∑
= ==
= =
= ==
= =
N
i
N
j
N
j
N
i
N
jCS
N
i
N
j
N
j
N
i
N
jED
GN
jxixGNN
xGjxixGN
XXI
GN
jxixGNN
jxixGjxixGN
XXI
σ
σσ
σσ
σσ
JO
U.
EE
35
pri
co
, but there are ples (joint space), (the CIP) and the generalized to any
derived
adapting systems
1
ˆ∏=
k
K
kV
)ik
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 30 of 35
The interactions are always based on the marginalsthree levels: the level of the individual marginal samthe level of the marginal sample and marginal fieldsproduct of the two marginal fields. This field can be number of variables
The information forces for each field are also easily
These expressions are going to be very useful whenwith divergences or quadratic mutual information.
2
1 1
11 1 12
1
1 11 1 121
)(ˆ1
ˆ)(ˆ1
log),...,(ˆ
)(ˆ2)(ˆ1),...,(ˆ
⎟⎟⎠
⎞⎜⎜⎝
⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛
=
+−=
∑∏
∏∑∑∏
∑∏∑∑∏
= =
== = =
= == = =
N
ik
K
k
k
K
k
N
i
N
jk
K
kKCS
N
ik
K
k
N
i
N
jk
K
kKED
iVN
VijVN
XXI
iVN
ijVN
XXI
(
ˆ1)(
ˆ2)(
ˆ1)(
ˆ)(ˆ
)(
ˆ
)(
ˆ2)(
ˆ
)(
ˆ)(ˆ
xV
VixV
VixV
VixIiF
ixV
ixV
ixV
ixIiF
kkk
C
Ck
j
jk
CSCS
k
k
k
C
k
j
k
CSED
∂∂
+∂∂
−∂∂
=∂∂
=
∂∂
+∂∂
−∂
∂=
∂∂
=
JO
U.
EE
35
pri
co
l Calculations
logy is the compu-re two ways to get racy: The Fast
omposition. They N.
aussian function r Gaussians.
aussian function
omials with a small
( ) n
nn
n dxxdyh )exp(1)(
2−−=
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 31 of 35
Fast Information and Cross Information Potentia
One of the problems with this IP estimator methodotational complexity which is O(N2) or O(N3). There aapproximations to the IP and CIP with arbitrary accuGauss Transform and the Incomplete Cholesky decboth transform the complexity to O(NM) where M <<
Fast Gauss Transform
The FGT takes advantage of the fast decay of the Gand can efficiently compute weighted sums of scala
The savings come from the shifting property of the Gwhich reads
and the efficient approximation of the Hermite polynorder p.
MiewyS Nj
yyii
ij ,...1)( 14/)( 22
==∑ =−− σ
⎟⎟⎠
⎞⎜⎜⎝
⎛ −⎟⎠⎞
⎜⎝⎛ −
== ∑∞
=
⎟⎟⎠
⎞⎜⎜⎝
⎛ −−⎟⎟
⎠
⎞⎜⎜⎝
⎛ −−−−⎟⎟
⎠
⎞⎜⎜⎝
⎛ −−
σσσσσ cj
n
n
nci
yyyyyyyyyy
Hyyn
eeecjCiCjij
0
)(
!1
222
JO
U.
EE
35
pri
co
eed to evaluate is computed
Np) computation. ns and computed are found using
immediately given
or 5 (independent y small (~10), but
idimensional the dimension. In series has been ssian as
∑∈
⎟⎟⎠
⎞⎜⎜⎝
⎛ −=
By
nCj
ni
byy
bσ2
)(
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 32 of 35
The shifting property is useful because there is no nevery Gaussian at every point. Instead a p term sumaround a small number yc of cluster centers with O(These sums are then shifted to the yi desired locatioin another O(Mp) operations. Normally the centers cthe furthest point algorithm which is efficient.
The information potential calculation (M=N) can be as
which requires O(NpB) calculations. p is normally 4of N), and B, the number of clusters is also relativelwith a weak dependence on N.
The problem with this algorithm is if the data is multbecause the complexity p changes to pD where D isorder to cope with high dimensions, a vector Taylor proposed to approximate the high dimensional Gau
)(2!
12
1)(1 1
1
02 bC
yyh
nNyV n
Cjn
N
j
B
b
p
n
b
⎟⎟⎠
⎞⎜⎜⎝
⎛ −≈ ∑∑∑
= =
−
= σπσC
JO
U.
EE
35
pri
coansion as
Brp,D) instead of
be larger than
⎟⎟⎠
⎞−⋅−24
)()
σ
cc iy
!!!! 21 dαααα L=
dαααα +++= L21
⎟⎟⎠
⎞− Bj c2
α
σy
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 33 of 35
and the cross term is approximated by a Taylor exp
Now the IP for multidimensions becomes
For a D dimensional data set the calculation is O(N
O(NBrD) with , but normally p mustbefore for the same precision.
⎜⎜⎝
⎛⎟⎟
⎠
⎞⎜⎜
⎝
⎛ −−
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ −−=
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ −− 2
2
2
2
2
2(
2exp4
exp4
exp4
expσσσ
cc jijij yyyyy
)(22!
24
)()(2exp
02 αε
σσασ α
ααα
+⎟⎠⎞
⎜⎝⎛ −
⎟⎟⎠
⎞⎜⎜⎝
⎛ −=⎟⎟⎠
⎞⎜⎜⎝
⎛ −⋅− ∑≥
cccc ijij yyyy
( )( )∑∑∑
= ≥⎜⎜⎝
⎛
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ −−≈
N
j B
BjdT
cBC
NV
1 02
2
222 4exp
4
1)(α
α σπσ
yy
( )⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎠⎞⎜
⎝⎛ −⎟⎟
⎠
⎞
⎜⎜
⎝
⎛ −−= ∑
∈Be
BiBi
i
ccBC
αα
α σσα 24exp
!2
2
2yy
r p D+d⎝ ⎠
⎛ ⎞ p D+( )!D!p!
--------------------= =
JO
U.
EE
35
pri
co
d by the pairwise spectrum decays
. wer triangular pectum falls fast ix such that
. The computation
G̃
2
2
)~)(~~)(~N
TYYYYXX
TXX GGG 1
2
2
~YYG
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 34 of 35
Incomplete Cholesky Decomposition
It turns out that the eigenvalues of the matrix createevaluation of the Gaussian is full rank, but the eigenvery fast. Hence we can take advantage of this factFor a NxN symmetric matrix K=GTG, where G is a lomatrix of positive diagonal entries. When the eigenswe can approximate K by NxD lower triangular matr
. There are ways of computing effectivelycomplexity of the procedure becomes O(ND). The IP can be computed as
ε<− GGK T ~~G̃
221 1
2~11)(1)(ˆ XX
TNNXX
TN
N
i
N
jji G
NK
Nxxk
NXV 111 ==−= ∑∑
= =
(2~~1)~~~~(1ˆ3
2
2
2
242TNYY
TNXX
TNYYXX
TYYXX
TED G
NGG
NGGGG
NI 11111
yX DTD −+= o
( )22
2
)~)(~~)(~(
~)~~~~(logˆ
NTYYYY
TXXXX
TN
TNXX
TNYYXX
TYYXX
T
CSGGGG
GGGGGI
11
1111yX D
TD o
=
JO
U.
EE
35
pri
co
we need to extend
here KXY denotes IP becomes
can be computed
{{T
NN
}0,...0,1, ...1{1 =e
{{T
NN
}1,. ..1,0,. ..0{=2e
)~)(~2
T1 eT
ZZZZ GG
)
SE C. PRINCIPE
OF FLORIDA
L 6935
2-392-2662
pyright 2009
page 35 of 35
However the CIP is not a positive definite matrix, so
KXX to create a positive definite matrix wthe Gram matrix for the CIP. The calculation of the C
with complexity O(ND2). The divergence measures efficiently as
YYXY
XYXXZZ KK
KKK =
( )( )2122121 1
2~~11)(1);(ˆ eeee ZZZZ
Tzz
TN
i
N
jji GG
NK
Nyxk
NYXV ==−= ∑∑
= =
2)~)(~1)~)(~1ˆ222 2
T21
T1 (ee(ee(e T
ZZZZTZZZZED N
GGN
GGN
D −+=
( )2)~)(~
~)(~)~)(~logˆ
2T1
2T21
T1
e(e
e(ee(eTZZZZ
TZZZZ
TZZZZ
CSGG
GGGGD =