35
JOSE C. PRINCIPE U. OF FLORIDA EEL 6935 352-392-2662 [email protected] copyright 2009 page 1 of 35 Renyi’s entropy History: Alfred Renyi was looking for the most general definition of information measures that would preserve the additivity for indepen- dent events and was compatible with the axioms of probability. He started with Cauchy’s functional equation: If p and q are indepen- dent than I(pq)=I(p)+I(q). Apart from a normalizing constant this is compatible with Hartley’s information content I(p)=-log p. If we assume that the events X={x 1 ,...x N } have different probabilities {p 1 ,...p N }, and each delivers I k bits of information, then the total amount of information for the set is This can be recognized as Shannon’s entropy. But he reasoned that there is an implicit assumption used in this equation: we use the linear average, which is not the only one that can be used. In the general theory of means, for any function g(x) with inverse g -1 , the mean can be computed as = = N k k k I p P I 1 ) ( = N k k k x g p g 1 1 )) ( (

Renyis entropy

Embed Size (px)

Citation preview

Page 1: Renyis entropy

JO

U.

EE

35

pri

co

ral definition of ivity for indepen- probability.

d q are indepen-

with Hartley’s e events d each delivers Ik

tion for the set is

he reasoned that : we use the linear . ) with inverse g-1,

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 1 of 35

Renyi’s entropy

History: Alfred Renyi was looking for the most geneinformation measures that would preserve the additdent events and was compatible with the axioms of

He started with Cauchy’s functional equation: If p andent than I(pq)=I(p)+I(q).

Apart from a normalizing constant this is compatibleinformation content I(p)=-log p. If we assume that thX={x1,...xN} have different probabilities {p1,...pN}, anbits of information, then the total amount of informa

This can be recognized as Shannon’s entropy. But there is an implicit assumption used in this equationaverage, which is not the only one that can be usedIn the general theory of means, for any function g(xthe mean can be computed as

∑=

=N

kkkIpPI

1)(

∑=

−N

kkk xgpg

1

1 ))((

Page 2: Renyis entropy

JO

U.

EE

35

pri

co

nts is applied we

cond gives

ametric family of entropies.

n

case.

α 1→

)(

1

XHS

k

=

−α

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 2 of 35

Applying this definition to the I(P) we get

When the postulate of additivity for independent eveget just two possible g(x):

g(x)=cxg(x)=c-2(1-α)x

The first form gives Shannon information and the se

for non negative α different from 1. This gives a parinformation measures that are called today Renyi’s

It can be shown that Shannon is a special case whe

So Renyi’s entropies contain Shannon as a special

∑=

−=N

kkk IgpgPI

1

1 ))(()(

)log(1

1)(1∑=−

=N

kkpPI α

α α

1lim

loglimlog

11lim)(lim

1

111

111

ppppXH

N

kkk

N

kN

kk −

⋅=

−=

==→

=→→

∑∑∑

α

α

αα

ααα α

Page 3: Renyis entropy

JO

U.

EE

35

pri

co

d of weighting the the sum and its

e PMF.

.v. exist on what is

∑=k

kpXV αα )(

1

p3

p1 p2 p3, ,( )=

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 3 of 35

Meaning of α

If we compare the two definitions we see that instealog pk by the probabilities, here the log is external toargument is the α power of the PMF, i.e.

Note that Vα(X) is the argument of the α norm of th

A geometric view will help here. All the PMFs of N rcalled the simplex, in an N dimensional space.

( ) ( )))((log)(log()(log1

1)( 11

1 −−

− −=−=−

= αα

αααα α

XVEXVXVXH

0

1

1

p p1 p2,( )=

p1

p2

1

1

p2

p1

0

p

pkα

k 1=

n

∑ p αα (entropy α-norm)=

(α norm of – p raised power to α)

Page 4: Renyis entropy

JO

U.

EE

35

pri

co

to the origin, and Changing α

α-norm of the flexible than Shan-

[ 0 1 0 ]

[ 0 1 0 ]

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 4 of 35

The norm measures exactly the distance of the PMFα is the designated l-norm (Euclidean norm is l=2). changes the metric distance in the simplex.

From another perspective, the α root of Vα(X) is thePMF. So we conclude that Renyi’s entropy is more non and includes Shannon as a special case.

We will be using extensively α=2.

[ 0 0 1 ]

[ 1 0 0 ] [ 0 1 0 ]α = 0.2

[ 0 0 1 ]

[ 1 0 0 ] [ 0 1 0 ]α = 0.5

[ 0 0 1 ]

[ 1 0 0 ] α = 1

[ 0 0 1 ]

[ 1 0 0 ] [ 0 1 0 ]α = 2

[ 0 0 1 ]

[ 1 0 0 ] [ 0 1 0 ]α = 4

[ 0 0 1 ]

[ 1 0 0 ] α = 20

Page 5: Renyis entropy

JO

U.

EE

35

pri

co

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 5 of 35

Properties of Renyi’s entropies

Page 6: Renyis entropy

JO

U.

EE

35

pri

co

alled the quadratic

d in economics.

, which is the 2-t it is the E[p(X)] or is the mean of the ation.

se we have found

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 6 of 35

Renyi’s quadratic entropyWe will be using heavily Renyi’s entropy with α=2, centropy

It has been used in physics, in signal processing an

We want also to stress that the argument of the lognorm of the PMF V2(X) has meaning in itself. In facif we make the change in variables ξk=p(xk), then it transformed variable when the PMF is the transform

For us Reny’s quadratic entropy is appealing becauan easy way to estimate it directly from samples.

)log()( 22 ∑−=

kkpXH

x

p(x)

Mean p(X)

mean of X

Page 7: Renyis entropy

JO

U.

EE

35

pri

co

extends to continu-

ve, in fact it can

dx)

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 7 of 35

Extensions to continuous variables.

It is possible to show that Renyi’s entropy measure ous r.v. and reads

and for the quadratic entropy

Notice however that now entropy is no longer positibecome arbitrarily large negative.

∫−=−=

∞→xpnPIXH nn

(log1

1)log)((lim)( α

αα α

∫−= dxxpXH )(log)( 22

Page 8: Renyis entropy

JO

U.

EE

35

pri

co

d kernel density is now called many important

ples and sum with

nbiased and con-

lly a Gaussian is

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 8 of 35

Estimation of Renyi’s quadratic entropy

We will use here ideas from density estimation calleestimation. This is an old method (Rosenblatt) that Parzen density estimation because Parzen proved statistical properties of the estimator.

The idea is very simple: Place a kernel over the samproper normalization, i.e.

The kernel has the following properties:

Parzen proved that the estimator is asymptotically usistent with a good efficiency. The free parameter is called the kernel size. Normaused and σ becomes the standard deviation.

∑=

−=

N

i

iX

xxN

xp1

)(1)(ˆσ

κσ

1. ∞<kRsup

2. ∫ ∞<R

k

3. 0)(lim =∞→

xxkx

4. ∫ =≥R

dxxkxk 1)(,0)(

Page 9: Renyis entropy

JO

U.

EE

35

pri

co

ing the PDF that is 2-norm of the PDF. lds immediately

tential. ute the integral ns is the value of a er kernel size). (N2), and there is

the data. ost cases

)

)⎟⎟⎠

⎞i

dx

dxx

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 9 of 35

But notice that here we are not interested in estimata function, we are interested in a single number the For a Gaussian kernel, substituting the estimator yie

We call this estimator for V2(X), the Information PoLet us look at the derivation: We never had to compexplicitly since the integral of the product of GaussiaGaussian at the difference of arguments (with a largLet us also look at the expression: The algorithm is Oa free parameter σ that the user has to select from Cross validation or the Silverman’s rule suffice for m

))(1log(

()(1log

()(1log

)(1log)(ˆ

1 122

1 12

1 12

2

12

∑∑

∑∑ ∫

∫ ∑∑

∫ ∑

= =

= =

∞−

∞− = =

∞− =

−−=

−⋅−−=

⎜⎜⎝

⎛−⋅−−=

⎟⎟⎠

⎞⎜⎜⎝

⎛−−=

N

i

N

jij

N

i

N

jij

N

i

N

jj

N

ii

xxGN

xxGxxGN

xGxxGN

dxxxGN

XH

σ

σσ

σσ

σ

( ) 41

11 )12(4 +−− += dXopt dNσσ

Page 10: Renyis entropy

JO

U.

EE

35

pri

co

y’s entropy of any

al mean

imators.

])(X

1

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 10 of 35

Extended Estimator for Renyi’s Entropy

We can come up with still another estimator for Renorder

by approximating the expected value by the empiric

to yield

We will now present a set of properties of these est

[log1

1)(log

11

)( 1pEdxxpXH XXX−

∞−

Δ

−=

−= ∫ αα

α αα

∑=

−≈

N

jjX xp

NXH

1

1 )(1log1

1)( αα α

∑ ∑

∑ ∑

=

=

=

=

⎟⎟⎠

⎞⎜⎜⎝

⎛−

−=

⎟⎟⎠

⎞⎜⎜⎝

⎛−

−=

N

j

N

iij

N

j

N

iij

xxN

xxNN

XH

1

1

1

1 1

)(1log1

1

)(11

log1

1)(ˆ

α

σα

α

σα

κα

κα

Page 11: Renyis entropy

JO

U.

EE

35

pri

co

opy using Gaussian

kernel size.

the estimator of Eq.

caling property

mean of the

nnon’s entropy. The

’s entropy estimated

mple mean.

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 11 of 35

Properties of the EstimatorsProperty 2.1: The estimator of Eq. (2.18) for Renyi’s quadratic entr

kernels only differs from the IP of Eq. (2.14) by a factor of 2 in the

Property 2.2: For any Parzen kernel that obeys the relation

∫∞

∞−

−⋅−=− dxxxxxxx jold

iold

ijnew )()()( κκκ (2.19)

the estimator of Renyi’s quadratic entropy of Eq. (2.18) matched (2.14) using the IP.

Property 2.3. The kernel size must be a parameter that satisfies the sccxxc /)/()( σσ κκ = for any positive factor c

Property 2.4. The entropy estimator in Eq. (2.18) is invariant to the underlying density of the samples as is the actual entropy

Property 2.5. The limit of Renyi’s entropy as 1→α is Sha

limit of the entropy estimator in Eq. (2.18) as 1→α is Shannon

using Parzen windowing with the expectation approximated by the sa

Page 12: Renyis entropy

JO

U.

EE

35

pri

co

rty of the actual ariable X is estimated ,cxN} of a random

al random vector X

e product of single-

and estimate of the

tent if the Parzen

e iid samples.

en ξ = 0, then the n all samples are

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 12 of 35

Properties of the EstimatorsProperty 2.6. In order to maintain consistency with the scaling propeentropy, if the entropy estimate of samples {x1,…,xN} of a random vusing a kernel size of σ, the entropy estimate of the samples {cx1,…variable cX must be estimated using a kernel size of |c|σ.

Property 2.7. When estimating the joint entropy of an n-dimension

from its samples {x1,…,xN}, use a multi-dimensional kernel that is th

dimensional kernels. This way, the estimate of the joint entropy

marginal entropies are consistent.

Theorem 2.1. The entropy estimator in Eq. (2.18) is consis

windowing and the sample mean are consistent for the actual PDF of th

Theorem 2.2. If the maximum value of the kernel κσ(ξ) is achieved whminimum value of the entropy estimator in Eq. (2.18) is achieved wheequal to each other, i.e., x1=…= xN = c

Page 13: Renyis entropy

JO

U.

EE

35

pri

co

able, symmetric and

entropy estimator in

emi-definite Hessian

Theorem 2.3, then in

estim ator approaches

.

multi-dimensional

trices, R, then the

actual entropy of a

ion requires hyper-

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 13 of 35

Properties of the EstimatorsTheorem 2.3. If the kernel function κσ(.) is continuous, differenti

unimodal, then the global minimum described in Theorem 2.2 of the

Eq. (2.18) is smooth, i.e., it has a zero gradient and a positive s

matrix.

Property 2.8. If the kernel function satisfies the conditions in

the limit, as the kernel size tends to infinity, the quadratic entropy

to the logarithm of a scaled and biased version of the sample variance

Property 2.9. In the case of joint entropy estimation, if the

kernel function satisfies )()( 1ξκξκ −ΣΣ = R for all orthonormal ma

entropy estimator in Eq. (2.18) is invariant under rotations as is the

random vector X. Notice that the condition on the joint kernel funct

spherical symmetry.

Page 14: Renyis entropy

JO

U.

EE

35

pri

co

or)

theory. The IP

om variable with the

and only if (iff) the ite-sample case.

)]( Xf ′′

∞→Nas,σ

])lx

4)1()2

NNbN

σ−+−

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 14 of 35

Properties of the Estimators

Bias of the Information Potential

Variance of the Information Potential

The AMISE (asymptotic mean integrated square err

This is expected from the kernel density estimation belongs to this class of estimators.

Theorem 2.4. )()ˆ()(ˆlim XHXHXHN ααα ≥=

∞→, where X̂ is a rand

PDF (.)*(.) σκXf . The equality (in the inequality portion) occurs ifkernel size is zero. This result is also valid on the average for the fin

[)2/()()](ˆ[)](ˆ[ 22 EdxxfXVEXVBias =−= ∫ σ

≈−+−−=−=Na

NNbNNNaNXVEXVEXVVar )1()2)(1()))(ˆ(())(ˆ())(ˆ( 4

22

σ

[ ] [ ] [)]([

()()()(

ji

jjiljji

xxKVarb

xKExxKExxKxxKEa

−=

−−−−−=

σ

σσσσ

24

2 )(1())((

2]))()(ˆ([))(ˆ(

NNaNdxxfdxXVXVEXVAMISE

σ −+′′=−= ∫∫

Page 15: Renyis entropy

JO

U.

EE

35

pri

co

tial Estimator

ink of the gravita-

icles) of unit mass by the Parzen ker-

density estimation

e of the samples

− jii xxx ))(

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 15 of 35

Physical Interpretation of the Information Poten

There is a useful physical interpretation of the IP. Thtional or the electrostatic field. Assume the samples are particles (information partthat interact with each other with the rules specifiednel used. This analogy is exact!The information potential field is given by the kernel

the Information Potential is the total potential

Now if there is potential there are forces in the spacthat can be easily computed as

∑=

Δ−=

N

iijj xxG

NxV

122 )(1)(ˆ

σ

∑=

=N

jjxVNXV

122 )(ˆ/1)(ˆ

∑∑==

−=−′=∂∂ N

ij

N

iijj

jxG

NxxG

NxV

x 122

122 (

21)(1)(ˆ

σσ σ

∑=

Δ

Δ

=∂∂=

−′=

N

iijj

jj

ijij

xxFxVx

xF

xxGN

xxF

1222

22

);()()(

)(1);( σ

Page 16: Renyis entropy

JO

U.

EE

35

pri

co

enter of 2-D space

les is to create two

rs. As we can see nel.

∑=

N

i

ij

iVN 1

)(ˆ1

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 16 of 35

Interaction between one information particle in the c

Another way to look at the interactions among sampmatrices in the space

D is a matrix of distances, and ζ is a matrix of scalathe samples create a metric space given by the ker

⎩⎨⎧

==−==

)(ˆ}ˆ{},{

2 ijijij

jiijij

GVVdD

dxxd

σς

∑∑

= =

=

=

==

⎪⎪⎩

⎪⎪⎨

−=

=

N

i

N

jij

N

jij

N

jij

VN

XV

dVN

iF

VN

iVfields

1 12

12

1

ˆ1)(ˆ

ˆ1)(ˆ

ˆ1)(ˆ

σ

Page 17: Renyis entropy

JO

U.

EE

35

pri

co

ntropy is maxi- their size is the

1

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 17 of 35

Example of forces pulling apart samples when the emized. The lines show the direction of the force andintensity of the pull.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 18: Renyis entropy

JO

U.

EE

35

pri

co

xtends to any s the interaction

IP. α<2 empha- for α>2).

)(ˆ) 2 jj xVx

)( ij xx ⎟⎟⎠

⎞−

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 18 of 35

Extension to any α and kernel

The framework of information potential and forces eParzen kernel where the shape of the kernel createfield. Changing α also changes the field. The α information potential field is

and it can be expressed as a function of the IP as

The α information force is likewise given by

so conceptually all fields can be composed from thesizes regions of lower density of samples (opposite

=

=−

Δ

=

⎟⎟⎠

⎞⎜⎜⎝

⎛−=

N

jj

N

iijj

xVN

XV

xxN

xV

1

1

11

)(ˆ1)(ˆ

)(1)(ˆ

αα

α

σαα κ

(ˆ)(1)(1)(ˆ 2

1

2

12

N

iij

N

iijj pxx

Nxx

NxV −

=

=− =−⎟⎟

⎞⎜⎜⎝

⎛−= ∑∑ α

σ

α

σαα κκ

)(ˆ)(ˆ)1(

)(1)(ˆ)(ˆ

22

1

2

11

jjX

N

i

N

iijj

jj

xFxp

xxN

xVx

xF

=

=−

Δ

−=

⎜⎜⎝

⎛′⎟⎟

⎞⎜⎜⎝

⎛−−=

∂∂= ∑∑

α

σ

α

σααα

α

κκα

Page 19: Renyis entropy

JO

U.

EE

35

pri

co

ropy definition iscussed in ch 1.

))

⎟⎟⎠

⎞−

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 19 of 35

Divergence Measures

Renyi proposed a divergence measure from his entwhich is different from Kullback Liebler divergence d

Renyi’s divergence

with properties

∫∞

∞−

−Δ

⎟⎟⎠

⎞⎜⎜⎝

⎛−

= dxxgxfxfgfD

1

)()()(log

11)||(

α

α α

i. 0,,,0)||( >∀≥ αα gfgfD

ii. 0)||( =gfDα iff f(x) = g(x) ℜ∈∀x

iii. )||()||(lim1

gfDgfD KL=→

αα

)||(ˆ

)(

)(1log

11

(ˆ(ˆ1log

11

)()(log

11)||(

1

1

1

1

1

1

gfDxx

xx

N

xgxf

NxgxfEgfD

N

jm

i

gi

gj

g

N

i

fi

fj

f

N

jgj

fj

p

α

α

α

α

κ

κ

α

αα

=

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎜⎜⎝

−≈

⎥⎥⎦

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛−

=

∑∑

=

=

=

=

Page 20: Renyis entropy

JO

U.

EE

35

pri

co

etween the joint

e approximate the

ver Renyi’s diver-annon measure of information is .

−1 α

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 20 of 35

Reny’s Mutual Information

Renyi’s mutual information is also the divergence band the product of the marginals

and we can come up with a kernel estimate for it if wE[.] by the empirical mean

that also approximates KL estimator for α->1. Howegence and MI are not as general as KL because Shinformation is the only one for which the increase inequal to the negative of the decrease in uncertainty

∫ ∫∏

∞−

∞−

=

Δ

−= n

n

o

oX

nX dxdx

xp

xxpXI

o

...)(

),...,(log

11

)( 1

1

1

1

α

α

αL

∑∏ ∑

∑∏

∑∏ ∑

=

= =

= =

=

= =

=ΣΔ

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎟⎟⎠

⎞⎜⎜⎝

⎛−

⎟⎟⎠

⎞⎜⎜⎝

⎛−

−=

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎟⎟⎠

⎞⎜⎜⎝

⎛−

⎟⎟⎠

⎞⎜⎜⎝

⎛−

−=

N

jn

o

oi

oj

N

i

N

i

n

o

oi

oj

N

jn

o

oi

oj

N

i

N

iij

xxN

xxN

N

xxN

xxN

NXI

o

o

o

1

1 1

1 1

1

1

1 1

1

)(1

)(11log

11

)(1

)(11log

11)(ˆ

σ

σ

α

σ

α

κ

κ

α

κ

κ

α

Page 21: Renyis entropy

JO

U.

EE

35

pri

co

plex issue and be used. If we the FIsher infor-he geodesic dis-. This is related to ence with α=1/2)

a linear projection This resembles ’s divergence

tropy (the IP) we ivergences called

e between f and g.

k

)] 2/1)dx

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 21 of 35

Quadratic Divergence and Mutual Information

Measuring dissimilarity in probability space is a comthere are many other divergence measures that cantransform the simplex to an hypersphere preservingmation, the coordinates of each PMF become . Ttance between PMF f and g becomes the Bhattacharyya distance (which is Renyi’s diverg

Now if we measure the distance between f and g inspace (the cordal distance) we get the Hellinger distance which is also related to Reny

Since we have an estimator for quadratic Renyi’s enwill substitute α=1/2 for 2 and define two quadratic dEuclidean distance and Cauchy Schwarz divergenc

pk

DGcos fk g∑=

( )∫−= dxxgxfgfDB )()(ln),(

DH fk( gk )0.5

–∑=

( ) ([2/12()(12)()(),( ∫∫ −=⎥⎦

⎤⎢⎣⎡ −= xgxfdxxgxfgfDH

Page 22: Renyis entropy

JO

U.

EE

35

pri

co

as

d with the IP. The d so will be called the potential cre-mples of the other

IED between two

dent. s is also possible

a divergence.

∫+ dxxgx 2)(

)

))ix

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 22 of 35

Euclidean distance between PDFs

We will drop the square root and define the distance

Notice that each of the three terms can be estimatemiddle term involves samples from the two PDFs anthe Cross Information Potential (CIP) that measuresated by one PDF in the locations specified by the saPDF.

We will define the quadratic mutual information QMrandom variables as

Notice that the IED is zero if the two r.v. are indepenThe extension to multidimensional random variable

Although a distance, we sometimes refer to DED as

∫ ∫ ∫−=−= dxgxfdxxfdxxgxfgfDED22 )()(2)())()((),(

)()(),,((),( 212121 2121xfxfxxfDXXI XXXXEDED =

( ) (,,...(),...,(1

11 ∏=

=k

iXkXEDkED fxxfDXXI

i

Page 23: Renyis entropy

JO

U.

EE

35

pri

co

a distance but can ality

e is defined as

(x) and it is sym-We will also work also write

QMICS as

can also be

))() dxxg

x2) )

))i

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 23 of 35

Cauchy Schwarz divergence

The other divergence is related to the Bhattacharyybe formally derived from the Cauchy Schwarz inequ

where the equality holds iff f(x)=g(x). The divergenc

DCS is always greater than zero, it is zero for f(x)=gmetric (but does not obey the triangular inequality). with the square of this quantity in practice. We can

We can also define the quadratic mutual information

and as before for independent variables it is zero. Itextended to multiple variables easily

∫∫ ∫ ≥ dxxgxfdxxgdxxf )()()()( 22

∫ ∫∫−=

dxxgdxxf

dxxgxfgfDCS

)()(

)()(log),(

22

(log(2))(log())(log(),( 22 ∫∫ ∫ −+= xfdxxgdxxfgfDCS

ICS X1 X2,( ) DCS fX1X2x1 x2,( ) fX1

x1( ) fX2(,(=

( ) (,,...(),...,(1

11 ∏=

=k

iXkXCSkCS xfxxfDXXI

i

Page 24: Renyis entropy

JO

U.

EE

35

pri

co

py

in terms of Renyi’s

Reny’s cross

non’s mutual infor-yi’s quadratic

))(2 2dxxg =∫

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 24 of 35

Relation between DCS and Renyi’s relative entro

There is a very interesting interpretation of the DCS entropy. Lutwak defines Renyi’s relative entropy as

for α=2 this gives exactly DCS, so

where the first term can be shown to be quadratics entropy. There is a striking similarity between DCS and Shanmation, but notice that here we are dealing with Renentropy.

( ) ( )( ) )1(

1

/11

11

)(

)()()(log),(

ααα

αααα

α

−−

∫∫=

R

RRR

xf

xgxfxggfD

)(2/1)(2/1);(

/1log())(2/1log())()(log(),(

222

2

YHXHYXH

dxxfdxxgxfYXDCS

−−=

−−= ∫∫

Page 25: Renyis entropy

JO

U.

EE

35

pri

co

ormation

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 25 of 35

Geometric Interpretation of Quadratic Mutual Inf

Let us define

therefore QMIED and QMIED can be rewritten as

The Figure illustrates these distances in the simplex

VJ fX1X2x1 x2,( )2 x1d x2d∫∫=

VM fX1x1( )fX2

x2( )( )2 x1d x2d∫∫=

Vc fX1X2x1 x2,( )fX1

x1( )fX2x2( ) x1d x2d∫∫=⎩

⎪⎪⎨⎪⎪⎧

IED VJ 2Vc VM+–=

ICS VJ 2 Vclog–log VMlog+=⎩⎨⎧

0

IED Euclidean Distance( )fX1 X2x1 x2,( )

fX1x1( )fX2

x2( )

ICS θcos( )2( )log–=

VJ

VM

Is K-L Divergence( )

Vc θcos VJVM=

θ

Page 26: Renyis entropy

JO

U.

EE

35

pri

co

PX11

Is

P X11

IED

PX11

IC S

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 26 of 35

Example

X1

X2

2

21

1PX2

1

PX2

2

PX1

1 PX1

2

PX11

PX12

PX22

PX21

⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

=

−=

∑∑

∑∑∑∑

∑∑

= =

= == =

= =

n

i

m

jXXX

n

i

m

jXX

n

i

m

jX

CS

n

i

m

jXXXED

jPiPjiP

jPiPjiP

XXI

jPiPjiPXXI

1 1

21

1 1

21

1 1

2

21

1 1

221

))()(),((

))()(()),((

log),(

))()(),((),(

2

2

21

Fix PX1= (0.6,0.4)

VaryP(1,1) -> [0, 0.6]P(2,1) -> [0, 0.4]

PX11PX

21

P X21

I s

IED

PX21

ICS

PX11

PX21

PX21

PX11

PX21

Page 27: Renyis entropy

JO

U.

EE

35

pri

co

aces

ework and the dis- its own potential cess, so the forces

i

c

ci

g

i

c

xV

VxV

xV

∂∂

−∂∂

∂∂

ˆˆ2ˆ

ˆ2

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 27 of 35

Information Potential and Forces in the Joint Sp

The important lesson to be learned from the IP framtances measures defined, is that each PDF createsthat interacts with the others. This is an additive proare also additive, and it becomes relatively simple.

Euclidean and Cauchy Schwarz divergences

The three potentials needed are

and the divergences and forces are written as

∑∑

∑∑

∑∑

= =

= =

= =

−=

−=

−=

N

i

N

jigfc

N

i

N

jiggg

N

i

N

jifff

jxixGN

V

jxixGN

V

jxixGN

V

1

222

1

222

1

222

))()((1ˆ

))()((1ˆ

))()((1ˆ

σ

σ

σ

ˆˆlogˆ),(ˆ

ˆ2ˆˆˆ),(ˆ

c

gfCSCS

cgfEDED

VVV

VgfD

VVVVgfD

==

−+==

gi

f

fi

CS

i

g

i

f

i

ED

VxV

VxV

xV

xV

xV

+∂∂

=∂∂

−∂∂

+∂∂

=∂∂

ˆ1ˆ

ˆ1ˆ

ˆˆˆ

Page 28: Renyis entropy

JO

U.

EE

35

pri

co

have two deal with ything is still addi-

⎟⎟⎠

=

k

jxixG 222

)104.2()

2,1

))()((σ

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 28 of 35

Quadratic Mutual Information

The QMIs are a bit more detailed because now we the joints, the marginals and their product. But evertive. The three PDFs are

they create the following fields

⎪⎪⎪

⎪⎪⎪

−=

−=

−=

=

=

=

N

iX

N

iX

N

iXX

ixxGN

xf

ixxGN

xf

iGN

xxf

1222

1111

121

))((1)(ˆ

))((1)(ˆ

))((1),(ˆ

2

1

21

σ

σ

σ xx

∑ ∑∑

∑∑

∑∑ ∑∑

= ==

= =

= = = =

⎜⎜⎝

⎛−⎟

⎟⎠

⎞⎜⎜⎝

⎛−=

−==

−=−=

N

i

N

j

N

jC

N

i

N

jkkkM

N

i

N

j

N

i

N

jJ

jxixGN

jxixGNN

V

jxixGN

VwithVVV

jxixGN

jiGN

V

1 1222

1112

1 12221

1 1 1 1112222

)()((1))()((11ˆ

)),()((1ˆˆˆˆ

))()((1))()((1ˆ

σσ

σ

σσ xx

Page 29: Renyis entropy

JO

U.

EE

35

pri

co

e

)102.2(

))((

))((

222

211

22

⎥⎥⎦

⎤−

=

N

j

dxjxxG

dxdxjxxG

σ

σ

( )2

222

212

22

21

))()((

ˆˆ))()

))()((

ˆˆ))

⎟⎟

⎞⎟⎟⎠

⎞−

⎟⎟⎠

⎞−

⎟⎟⎠

⎞−

−+

jxix

VVjxi

jxix

VV

σ

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 29 of 35

let us exemplify for VC

Notice that Vc has complexity O(N3). Finally we hav

))()((1))()((11

))(())(())((111

1))((1))(())((1

)(ˆ)(ˆ),(ˆˆ

1222

1 1112

12211111

1 1

111

12211

212121

⎥⎥⎦

⎢⎢⎣

⎡−⎥

⎤⎢⎣

⎡−=

−−−=

⎢⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡−⎥

⎤⎢⎣

⎡−−=

=

∑∑ ∑

∑ ∫∫∑ ∑

∑∑∫∫ ∑

∫∫

== =

== =

==

N

j

N

k

N

i

N

k

N

i

N

j

N

i

N

k

C

jxkxGN

ixkxGNN

kxxGdxkxxGixxGNNN

NixxG

NkxxGkxxG

N

dxdxxfxfxxfV

σσ

σσσ

σσσ

1 11112

221 1

1122

21

1 12

1112

2221 1

112221

1))()((11

(())()((1

log),(ˆ

1))()((11

()(())()((1),(ˆ

⎜⎜

⎛⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛−

⎜⎜⎝

⎛−

=

⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛−−

−−=

∑ ∑∑

∑∑

∑ ∑∑

∑∑

= ==

= =

= ==

= =

N

i

N

j

N

j

N

i

N

jCS

N

i

N

j

N

j

N

i

N

jED

GN

jxixGNN

xGjxixGN

XXI

GN

jxixGNN

jxixGjxixGN

XXI

σ

σσ

σσ

σσ

Page 30: Renyis entropy

JO

U.

EE

35

pri

co

, but there are ples (joint space), (the CIP) and the generalized to any

derived

adapting systems

1

ˆ∏=

k

K

kV

)ik

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 30 of 35

The interactions are always based on the marginalsthree levels: the level of the individual marginal samthe level of the marginal sample and marginal fieldsproduct of the two marginal fields. This field can be number of variables

The information forces for each field are also easily

These expressions are going to be very useful whenwith divergences or quadratic mutual information.

2

1 1

11 1 12

1

1 11 1 121

)(ˆ1

ˆ)(ˆ1

log),...,(ˆ

)(ˆ2)(ˆ1),...,(ˆ

⎟⎟⎠

⎞⎜⎜⎝

⎟⎟⎠

⎞⎜⎜⎝

=

+−=

∑∏

∏∑∑∏

∑∏∑∑∏

= =

== = =

= == = =

N

ik

K

k

k

K

k

N

i

N

jk

K

kKCS

N

ik

K

k

N

i

N

jk

K

kKED

iVN

VijVN

XXI

iVN

ijVN

XXI

(

ˆ1)(

ˆ2)(

ˆ1)(

ˆ)(ˆ

)(

ˆ

)(

ˆ2)(

ˆ

)(

ˆ)(ˆ

xV

VixV

VixV

VixIiF

ixV

ixV

ixV

ixIiF

kkk

C

Ck

j

jk

CSCS

k

k

k

C

k

j

k

CSED

∂∂

+∂∂

−∂∂

=∂∂

=

∂∂

+∂∂

−∂

∂=

∂∂

=

Page 31: Renyis entropy

JO

U.

EE

35

pri

co

l Calculations

logy is the compu-re two ways to get racy: The Fast

omposition. They N.

aussian function r Gaussians.

aussian function

omials with a small

( ) n

nn

n dxxdyh )exp(1)(

2−−=

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 31 of 35

Fast Information and Cross Information Potentia

One of the problems with this IP estimator methodotational complexity which is O(N2) or O(N3). There aapproximations to the IP and CIP with arbitrary accuGauss Transform and the Incomplete Cholesky decboth transform the complexity to O(NM) where M <<

Fast Gauss Transform

The FGT takes advantage of the fast decay of the Gand can efficiently compute weighted sums of scala

The savings come from the shifting property of the Gwhich reads

and the efficient approximation of the Hermite polynorder p.

MiewyS Nj

yyii

ij ,...1)( 14/)( 22

==∑ =−− σ

⎟⎟⎠

⎞⎜⎜⎝

⎛ −⎟⎠⎞

⎜⎝⎛ −

== ∑∞

=

⎟⎟⎠

⎞⎜⎜⎝

⎛ −−⎟⎟

⎞⎜⎜⎝

⎛ −−−−⎟⎟

⎞⎜⎜⎝

⎛ −−

σσσσσ cj

n

n

nci

yyyyyyyyyy

Hyyn

eeecjCiCjij

0

)(

!1

222

Page 32: Renyis entropy

JO

U.

EE

35

pri

co

eed to evaluate is computed

Np) computation. ns and computed are found using

immediately given

or 5 (independent y small (~10), but

idimensional the dimension. In series has been ssian as

∑∈

⎟⎟⎠

⎞⎜⎜⎝

⎛ −=

By

nCj

ni

byy

bσ2

)(

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 32 of 35

The shifting property is useful because there is no nevery Gaussian at every point. Instead a p term sumaround a small number yc of cluster centers with O(These sums are then shifted to the yi desired locatioin another O(Mp) operations. Normally the centers cthe furthest point algorithm which is efficient.

The information potential calculation (M=N) can be as

which requires O(NpB) calculations. p is normally 4of N), and B, the number of clusters is also relativelwith a weak dependence on N.

The problem with this algorithm is if the data is multbecause the complexity p changes to pD where D isorder to cope with high dimensions, a vector Taylor proposed to approximate the high dimensional Gau

)(2!

12

1)(1 1

1

02 bC

yyh

nNyV n

Cjn

N

j

B

b

p

n

b

⎟⎟⎠

⎞⎜⎜⎝

⎛ −≈ ∑∑∑

= =

= σπσC

Page 33: Renyis entropy

JO

U.

EE

35

pri

coansion as

Brp,D) instead of

be larger than

⎟⎟⎠

⎞−⋅−24

)()

σ

cc iy

!!!! 21 dαααα L=

dαααα +++= L21

⎟⎟⎠

⎞− Bj c2

α

σy

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 33 of 35

and the cross term is approximated by a Taylor exp

Now the IP for multidimensions becomes

For a D dimensional data set the calculation is O(N

O(NBrD) with , but normally p mustbefore for the same precision.

⎜⎜⎝

⎛⎟⎟

⎞⎜⎜

⎛ −−

⎟⎟⎟

⎜⎜⎜

⎛ −−=

⎟⎟⎟

⎜⎜⎜

⎛ −− 2

2

2

2

2

2(

2exp4

exp4

exp4

expσσσ

cc jijij yyyyy

)(22!

24

)()(2exp

02 αε

σσασ α

ααα

+⎟⎠⎞

⎜⎝⎛ −

⎟⎟⎠

⎞⎜⎜⎝

⎛ −=⎟⎟⎠

⎞⎜⎜⎝

⎛ −⋅− ∑≥

cccc ijij yyyy

( )( )∑∑∑

= ≥⎜⎜⎝

⎟⎟⎟

⎜⎜⎜

⎛ −−≈

N

j B

BjdT

cBC

NV

1 02

2

222 4exp

4

1)(α

α σπσ

yy

( )⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎠⎞⎜

⎝⎛ −⎟⎟

⎜⎜

⎛ −−= ∑

∈Be

BiBi

i

ccBC

αα

α σσα 24exp

!2

2

2yy

r p D+d⎝ ⎠

⎛ ⎞ p D+( )!D!p!

--------------------= =

Page 34: Renyis entropy

JO

U.

EE

35

pri

co

d by the pairwise spectrum decays

. wer triangular pectum falls fast ix such that

. The computation

2

2

)~)(~~)(~N

TYYYYXX

TXX GGG 1

2

2

~YYG

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 34 of 35

Incomplete Cholesky Decomposition

It turns out that the eigenvalues of the matrix createevaluation of the Gaussian is full rank, but the eigenvery fast. Hence we can take advantage of this factFor a NxN symmetric matrix K=GTG, where G is a lomatrix of positive diagonal entries. When the eigenswe can approximate K by NxD lower triangular matr

. There are ways of computing effectivelycomplexity of the procedure becomes O(ND). The IP can be computed as

ε<− GGK T ~~G̃

221 1

2~11)(1)(ˆ XX

TNNXX

TN

N

i

N

jji G

NK

Nxxk

NXV 111 ==−= ∑∑

= =

(2~~1)~~~~(1ˆ3

2

2

2

242TNYY

TNXX

TNYYXX

TYYXX

TED G

NGG

NGGGG

NI 11111

yX DTD −+= o

( )22

2

)~)(~~)(~(

~)~~~~(logˆ

NTYYYY

TXXXX

TN

TNXX

TNYYXX

TYYXX

T

CSGGGG

GGGGGI

11

1111yX D

TD o

=

Page 35: Renyis entropy

JO

U.

EE

35

pri

co

we need to extend

here KXY denotes IP becomes

can be computed

{{T

NN

}0,...0,1, ...1{1 =e

{{T

NN

}1,. ..1,0,. ..0{=2e

)~)(~2

T1 eT

ZZZZ GG

)

SE C. PRINCIPE

OF FLORIDA

L 6935

2-392-2662

[email protected]

pyright 2009

page 35 of 35

However the CIP is not a positive definite matrix, so

KXX to create a positive definite matrix wthe Gram matrix for the CIP. The calculation of the C

with complexity O(ND2). The divergence measures efficiently as

YYXY

XYXXZZ KK

KKK =

( )( )2122121 1

2~~11)(1);(ˆ eeee ZZZZ

Tzz

TN

i

N

jji GG

NK

Nyxk

NYXV ==−= ∑∑

= =

2)~)(~1)~)(~1ˆ222 2

T21

T1 (ee(ee(e T

ZZZZTZZZZED N

GGN

GGN

D −+=

( )2)~)(~

~)(~)~)(~logˆ

2T1

2T21

T1

e(e

e(ee(eTZZZZ

TZZZZ

TZZZZ

CSGG

GGGGD =