60
Entropy, Relative Entropy and Mutual Information Prof. Ja-Ling Wu Department of Computer Science and Information Engineering National Taiwan University

Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

  • Upload
    others

  • View
    19

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Entropy, Relative Entropy

and Mutual Information

Prof. Ja-Ling Wu

Department of Computer Science

and Information Engineering

National Taiwan University

Page 2: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 2

Definition: The Entropy H(X) of a discrete

random variable X is defined by

entropy thechange

not doesy probabilit zero of termsadding :

0)x as(xlogx 0 0log0

bits : H(P) 2 base : log

)(log)()(

XxPH

xPxPxH

Page 3: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 3

Note that entropy is a function of the distribution

of X. It does not depend on the actual values

taken by the r.v. X, but only on the probabilities.

)(

1

)(1

))((

log

log of valueexpected the X ofentropy The :Remark

as writtenis

g(x) .. theof valueexpected the then,, If

xP

xP

XxxEg

p

ExH

xPxgxgE

vrxPX

Expectation value

Self-information

Page 4: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 4

Lemma 1.1: H(x) 0

Lemma 1.2: Hb(x) = (logba) Ha(x)

Ex:

)()1log()1(log)(

1)1(,1

)0(,0

2 PHPPPPXH

PP

PPX

def

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

H2(P) 1) H(x)=1 bits when P=1/2

2) H(x) is a concave function of P

3) H(x)=0 if P=0 or 1

4) max H(x) occurs when P=1/2

Page 5: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 5

Joint Entropy and Conditional Entropy Definition: The joint entropy H(X, Y) of a pair of discrete random

variables (X, Y) with a joint distribution P(x, y) is defined as

Definition: The conditional entropy H(Y|X) is defined as

),(log),(

),(log),(),(

YXPEYXH

or

yxPyxPYXHXx Yy

)|(log

)|(log),(

)|(log)|()(

as defined is |)(|

),( XYPE

xyPyxP

xyPxyPxP

xXYHxPXYH

yxP

Xx Yy

Xx Yy

Xx

Page 6: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 6

Theorem 1.1 (Chain Rule):

)|(log)(log),(log

can write wely,equivalentor

)|()(

)|(log),()(log)(

)|(log),()(log),(

)|()(log),(

),(log),(),(

:

)|()(),(

XYPXPYXP

XYHXH

xyPyxPxPxP

xyPyxPxPyxP

xyPxPyxP

yxPyxPYXH

pf

XYHXHYXH

Xx Xx Yy

Xx YyXx Yy

Xx Yy

Xx Yy

Page 7: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 7

Corollary:

H(X, Y|Z) = H(X|Z) + H(Y|X,Z)

Remark:

(i) H(Y|X) H(X|Y)

(II) H(X) – H(X|Y) = H(Y) – H(Y|X)

Page 8: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 8

Relative Entropy and Mutual Information

The entropy of a random variable is a measure of the uncertainty of the random variable; it is a measure of the amount of information required on the average to describe the random variable.

The relative entropy is a measure of the distance between two distributions. In statistics, it arises as an expected logarithm of the likelihood ratio. The relative entropy D(p||q) is a measure of the inefficiency of assuming that the distribution is q when the true distribution is p.

Page 9: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 9

Ex: If we knew the true distribution of the r.v.,

then we could construct a code with average

description length H(p). If instead, we used the

code for a distribution q, we would need

H(p)+D(p||q) bits on the average to describe

the r.v..

Page 10: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 10

Definition:

The relative entropy or Kullback Liebler

distance between two probability mass

functions p(x) and q(x) is defines as

)(

1log

)(

1log

)(

1log

)(

1log

)(

)(log

)(

)(log)(||

xpE

xqE

xpxqE

xq

xpE

xq

xpxpqpD

pp

pp

Xx

Page 11: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 11

Definition:

Consider two r.v.’s X and Y with a joint

probability mass function p(x,y) and marginal

probability mass functions p(x) and p(y). The

mutual information I(X;Y) is the relative

entropy between the joint distribution and the

product distribution p(x)p(y), i.e.,

)()(

),(log

)()(||),(

)()(

),(log),();(

),(YPXP

YXPE

ypxpyxpD

ypxp

yxpyxpYXI

yxp

Xx Yy

Page 12: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 12

Ex: Let X = {0, 1} and consider two distributions p and

q on X. Let p(0)=1-r, p(1)=r, and let q(0)=1-s, q(1)=s.

Then

If r=s, then D(p||q)=D(q||p)=0

While, in general,

D(p||q)D(q||p)

r

ss

r

ss

p

qq

p

qqpqDand

s

rr

s

rr

q

pp

q

ppqpD

log1

1log)1(

)1(

)1(log)1(

)0(

)0(log)0(||

log1

1log)1(

)1(

)1(log)1(

)0(

)0(log)0(||

Page 13: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 13

Relationship between Entropy

and Mutual Information Rewrite I(X;Y) as

)|()(

)|(log),()(log)(

)|(log),()(log),(

)(

)|(log),(

)()(

),(log),();(

,

,,

,

,

YXHXH

yxpyxpxpxp

yxpyxpxpyxp

xp

yxpyxp

ypxp

yxpyxpYXI

yxx

yxyx

yx

yx

Page 14: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 14

Thus the mutual information I(X;Y) is the reduction in the

uncertainty of X due to the knowledge of Y.

By symmetry, it follows that

I(X;Y) = H(Y) – H(Y|X)

X says much about Y as Y says about X

Since H(X,Y) = H(X) + H(Y|X)

I(X;Y) = H(X) + H(Y) – H(X,Y)

I(X;X) = H(X) – H(X|X) = H(X)

The mutual information of a r.v. with itself is the entropy

of the r.v. entropy : self-information

Page 15: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 15

Theorem: (Mutual information and entropy):

i. I(X;Y) = H(X) – H(X|Y)

= H(Y) – H(Y|X)

= H(X) + H(Y) – H(X,Y)

ii. I(X;Y) = I(Y;X)

iii. I(X;X) = H(X)

I(X;Y)

H(X|Y)

H(Y|X)

H(X,Y)

H(X) H(Y)

Page 16: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 16

Chain Rules for Entropy, Relative Entropy

and Mutual Information

Theorem: (Chain rule for entropy)

Let X1, X2, …, Xn, be drawn according to

P(x1, x2, …, xn).

Then

n

i

iin XXXHXXXH1

1121 ),,|(),,,(

Page 17: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 17

Proof

(1)

n

i

ii

nnn

XXXH

XXXHXXHXHXXXH

XXXHXXHXH

XXXHXHXXXH

XXHXHXXH

1

11

1112121

123121

1321321

12121

),,|(

),,|()|()(),,,(

),|()|()(

)|,()(),,(

)|()( ),(

Page 18: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 18

(2) We write

n

i

ii

n

i XXX

iii

n

i XXX

iin

XXX

ii

n

i

n

XXX

n

i

iin

XXX

nn

n

n

i

iin

XXXH

xxxPxxxP

xxxPxxxP

xxxPxxxP

xxxPxxxP

xxxPxxxP

XXXH

then

xxxPxxxP

i

n

n

n

n

1

11

1 ,,,

1121

1 ,,,

1121

,,,

11

1

21

,,, 1

1121

,,,

2121

21

1

1121

,,|

,,|log,,,

,,|log,,,

,,|log,,,

,,|log,,,

,,,log,,,

,,,

,,|,,,

21

21

21

21

21

Page 19: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 19

Definition:

The conditional mutual information of rv’s. X

and Y given Z is defined by

)|()|(

)|,(log

),|()|()|;(

),,(ZYPZXP

ZYXPE

ZYXHZXHZYXI

zyxp

Page 20: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 20

Theorem: (chain rule for mutual-information)

proof:

n

i

iin XXYXIYXXXI1

1121 ),,|;();,,,(

n

i

iii

n

i

ii

n

i

ii

nn

n

XXXYXI

YXXXHXXXH

YXXXHXXXH

YXXXI

1

121

1

11

1

11

2121

21

),,,|;(

),,,|(),,|(

)|,,,(),,,(

);,,,(

Page 21: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 21

Definition: The conditional relative entropy D(p(y|x) || q (y|x)) is

the average of the relative entropies between the

conditional probability mass functions p(y|x) and q(y|x)

averaged over the probability mass function p(x).

Theorem: (Chain rule for relative entropy)

D(p(x,y)||q(x,y)) = D(p(x)||q(x))+ D(p(y|x)||q(y|x))

XYq

XYpE

xyq

xypxypxpxyqxypD

yxp

x y

|

|log

|

|log|||||

,

Page 22: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 22

Jensen’s Inequality and Its Consequences

Definition: A function is said to be convex over an

interval (a,b) if for every x1 , x2(a,b) and

0 1, f(x1+(1-)x2) f(x1)+(1-)f(x2)

A function f is said to be strictly convex

if equality holds only if =0 or =1.

Definition: A function is concave if –f is convex.

Ex: convex functions: X2, |X|, eX,XlogX (for X0)

concave functions: logX, X1/2 for X0

both convex and concave: ax+b; linear functions

Page 23: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 23

Theorem:

If the function f has a second derivative which

is non-negative (positive) everywhere, then the

function is convex (strictly convex).

case continuous : )(

case discrete : )(

dxxxpEX

xxpEXXx

Page 24: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 24

Theorem : (Jensen’s inequality):

If f(x) is convex function and X is a random variable, then Ef(X) f(EX).

Proof: For a two mass point distribution, the inequality becomes

p1f(x1)+p2f(x2) f(p1x1+p2x2), p1+p2=1

which follows directly from the definition of convex functions.

Suppose the theorem is true for distributions with K-1 mass points.

Then writing P’i=Pi/(1-PK) for i = 1, 2, …, K-1, we have

The proof can be extended to continuous distributions by continuity arguments.

(Mathematical Induction)

k

i

ii

k

i

iik

k

i

iik

k

i

iikkkk

k

i

iikkk

k

i

iikkk

k

i

iikkki

k

i

i

xpf

xppf

xppf

xppxppf

xppxpf

xpfpxfp

xfppxfpxfp

1

1

1

1

1

1

1

1

1

1

11

)(

))1((

))1((

))1()1((

))1((

)()1()(

)()1()()(

Page 25: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 25

Theorem: (Information inequality):

Let p(x), q(x) xX, be two probability mass functions. Then

D(p||q) 0

with equality iff p(x)=q(x) for all x.

Proof: Let A={x:p(x)>0} be the support set of p(x). Then

01log

)(log

)(log

concave) is t log ( )(

)()(log

)(

)(log

)(

)(log

)(

)(log)(

)(

)(log)()||(

Xx

Ax

Ax

Ax

Ax

xq

xq

xp

xqxp

xp

xqE

xp

xqE

xp

xqxp

xq

xpxpqpD

Page 26: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 26

Corollary: (Non-negativity of mutual information):

For any two rv’s., X, Y,

I(X;Y) 0

with equality iff X and Y are independent.

Proof:

I(X;Y) = D(p(x,y)||p(x)p(y)) 0 with equality iff

p(x,y)=p(x)p(y), i.e., X and Y are independent

Corollary:

D(p(y|x)||q (y|x)) 0

with equality iff p(y|x)=q(y|x) for all x and y with p(x)>0.

Corollary:

I(X;Y|Z) 0

with equality iff X and Y are conditionary independent given Z.

Page 27: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 27

Theorem:

H(x)log|X|, where |X| denotes the number of

elements in the range of X, with equality iff X has a

uniform distribution over X.

Proof:

Let u(x)=1/|X| be the uniform probability mass function

over X, and let p(x) be the probability mass function for

X. Then

)( log)||(0

entropy relative of negativity-non by the Hence

)(log)(

)(log)()||(

xHupD

xHxu

xpxpupD

X

X

Page 28: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 28

Theorem: (conditioning reduces entropy):

H(X|Y) H(X)

with equality iff X and Y are independent.

Proof: 0 I(X;Y)=H(X) – H(X|Y)

Note that this is true only on the average; specifically,

H(X|Y=y) may be greater than or less than or equal to

H(X), but on the average H(X|Y)=yp(y)H(X|Y=y) H(X).

Page 29: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 29

Ex: Let (X,Y) have the following joint

distribution

Then, H(X)=H(1/8, 7/8)=0.544 bits

H(X|Y=1)=0 bits

H(X|Y=2)=1 bits > H(X)

However, H(X|Y) = 3/4 H(X|Y=1)+1/4 H(X|Y=2)

= 0.25 bits < H(X)

X

Y 1 2

1 0 3/4

2 1/8 1/8

Page 30: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 30

Theorem: (Independence bound on entropy):

Let X1, X2, …,Xn be drawn according to p(x1, x2, …,xn).

Then

with equality iff the Xi are independent.

Proof: By the chain rule for entropies,

with equality iff the Xi’s are independent.

n

i

in XHXXXH1

21 ,,,

n

i

i

n

i

iin

XH

XXXHXXXH

1

1

1121 ,,|,,,

Page 31: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 31

The LOG SUM INEQUALITY AND ITS

APPLICATIONS

Theorem: (Log sum inequality)

For non-negative numbers, a1, a2, …, an and b1, b2.. bn

with equality iff ai/bi = constant.

n

i

i

n

i

in

i

i

n

i i

ii

b

a

ab

aa

1

1

11

loglog

00log

0a if alog 0, 0log0:sconvention some

00

0a

Page 32: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 32

Proof: Assume w.l.o.g that ai>0 and bi>0. The function f(t)=tlogt is strictly convex, since for all positive t. Hence by Jensen’s inequality, we have

which is the log sum inequality.

0log1

" et

tf

i

i

i

i

ii

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

i

iin

ii

ii

i

ii

iiii

b

aa

b

aa

bb

a

b

a

b

a

b

a

b

b

b

a

b

b

b

a

b

b

b

a

b

a

b

b

b

at

b

b

tftf

loglog

)0 that (note loglog

loglogobtain we

, and Setting .1,0for

1

Page 33: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 33

Reproving the theorem that D(p||q) 0, with equality iff

p(x)=q(x)

with equality iff p(x)/q(x)=c. Since both p and q are

probability mass functions, c=1 p(x)=q(x), x.

01

1log 1

)inequality sum-log from ( log

log||

xq

xpxp

xq

xpxpqpD

Page 34: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 34

Theorem:

D(p||q) is convex in the pair (p,q), i.e., if (p1, q1) and

(p2, q2) are two pairs of probability mass functions,

then

Proof:

10

)||()1()||())1(||)1(( 22112121

allfor

qpDqpDqqppD

)||()1()||(

log)1(log

)1(

)1(log)1(loglog

log)1(

)1(,

)1(,

)1()1(

)1(log)1(

))1(||)1((

2211

2

22

1

11

2

22

1

11

2

1

2

1

2

12

1

2211

2211

21

2121

2121

qpDqpD

q

pp

q

pp

q

pp

q

pp

b

aa

b

a

athen

qbqb

papaLet

qq

pppp

qqppD

i i

ii

i

i

i

i

i

i

log-sum

Page 35: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 35

Theorem: (concavity of entropy):

H(p) is a concave function of P.

That is: H(λ1p1+(1-λ)p2) ≧λH(p1)+(1-λ)H(p2)

Proof:

H(p)=log|X| – D(p||u)

where u is the uniform distribution on |X|

outcomes. The concavity of H then follows

directly from the convexity of D.

Page 36: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 36

Theorem: Let (X,Y)~p(x,y) = p(x)p(y|x).

The mutual information I(X;Y) is

(i) a concave function of p(x) for fixed p(y|x)

(ii) a convex function of p(y|x) for fixed p(x).

Proof:

(1) I(X;Y)=H(Y)-H(Y|X)=H(Y) – xp(x)H(Y|X=x) … ()

if p(y|x) is fixed, then p(y) is a linear function of

p(x). ( p(y) = xp(x,y) = xp(x)p(y|x) )

Hence H(Y), which is a concave function of

p(y), is a concave function of p(x). The second

term of () is a linear function of p(x). Hence the

difference is a concave function of p(x).

Page 37: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 37

(2) We fix p(x) and consider two different conditional distributions p1(y|x) and p2(y|x). The corresponding joint distributions are p1(x,y)=p(x) p1(y|x) and p2(x,y)=p(x) p2(y|x), and their respective marginals are p(x), p1(y) and p(x), p2(y).

Consider a conditional distribution

p(y|x)= p1(y|x)+(1-)p2(y|x)

that is a mixture of p1(y|x) and p2(y|x). The corresponding joint distribution is also a mixture of the corresponding joint distributions,

p(x,y) = p1 (x,y)+(1-)p2(x,y)

and the distribution of Y is also a mixture p(y)= p1 (y)+(1-)p2(y). Hence if we let q(x,y)=p(x)p(y) q(x,y)= q1 (x,y)+(1-)q2(x,y).

I(X;Y) = D(p||q) convex of (p,q)

the mutual information is a convex function of the conditional distribution. Therefore, the convexity of I(X;Y) is the same as that of the D(p||q) w.r.t. pi(y|x) when p(x) is fixed.

The product of the marginal distributions

when p(x) is fixed,

p(x,y) is linear with pi(y|x)

q(x,y) is also linear with pi(y|x)

when p(x) is fixed.

Page 38: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 38

Data processing inequality:

No clever manipulation of the data can improve the inferences that can be made from the data

Definition: Rv’s. X,Y,Z are said to form a Markov chain in that

order (denoted by XYZ) if the conditional distribution of Z depends only on Y and is conditionally independent of X. That is XYZ form a Markov chain, then

(i) p(x,y,z)=p(x)p(y|x)p(z|y)

(ii) p(x,z|y)=p(x|y)p(z|y) : X and Z are conditionally independent given Y

XYZ implies that ZYX

If Z=f(Y), then XYZ

Page 39: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 39

Theorem: (Data processing inequality)

if XYZ , then I(X;Y) I(X;Z)

No processing of Y, deterministic or random, can

increase the information that Y contains about X.

Proof:

I(X;Y,Z) = I(X;Z) + I(X;Y|Z) : chain rule

= I(X;Y) + I(X;Z|Y) : chain rule

Since X and Z are independent given Y, we have

I(X;Z|Y)=0. Since I(X;Y|Z)0, we have I(X;Y)I(X;Z)

with equality iff I(X;Y|Z)=0, i.e., XZY forms a

Markov chain. Similarly, one can prove I(Y;Z)I(X;Z).

Page 40: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 40

Corollary: If XYZ forms a Markov chain and if Z=g(Y), we

have I(X;Y)I(X;g(Y))

: functions of the data Y cannot increase the

information about X.

Corollary: If XYZ, then I(X;Y|Z)I(X;Y)

Proof: I(X;Y,Z)=I(X;Z)+I(X;Y|Z)

=I(X;Y)+I(X;Z|Y)

By Markovity, I(X;Z|Y)=0

and I(X;Z) 0 I(X;Y|Z)I(X;Y)

The dependence of X and Y is decreased (or remains

unchanged) by the observation of a “downstream” r.v. Z.

Page 41: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 41

Note that it is possible that I(X;Y|Z)>I(X;Y)

when X,Y and Z do not form a Markov chain.

Ex: Let X and Y be independent fair binary rv’s, and

let Z=X+Y. Then I(X;Y)=0, but

I(X;Y|Z) =H(X|Z) – H(X|Y,Z)

=H(X|Z)

=P(Z=1)H(X|Z=1)=1/2 bit.

Page 42: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 42

Fano’s inequality:

Fano’s inequality relates the probability of error in

guessing the r.v. X to its conditional entropy H(X|Y).

Note that:

The conditional entropy of a r.v. X given another

random variable Y is zero iff X is a function of Y.

proof: HW

we can estimate X from Y with zero probability of

error iff H(X|Y)=0.

we expect to be able to estimate X with a low

probability of error only if the conditional entropy

H(X|Y) is small.

Fano’s inequality quantifies this idea.

H(X|Y)=0 implies there is no uncertainty about X if we know Y

for all x with p(x)>0, there is only one possible value of y with p(x,y)>0

Page 43: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 43

Suppose we wish to estimate a r.v. X with a

distribution p(x). We observe a r.v. Y which is

related to X by the conditional distribution

p(y|x). From Y, we calculate a function

which is an estimate of X. We wish to bound

the probability that . We observe that

forms a Markov chain.

Define the probability of error

XYg

XX

XYX

XYgPXXPP rre

)(

Page 44: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 44

Theorem: (Fano’s inequality)

For any estimator X such that X→Y→X with Pe=Pr(X≠X),

we have

H(Pe) + Pelog(|X|-1) H(X|Y)

This inequality can be weakened to

1 + Pelog(|X|) H(X|Y)

or

Remark: Pe = 0 H(X|Y) = 0

H(Pe)1, E: binary r.v.

log(|X|-1) log|X|

||log

1|

X

YXHPe

^ ^ ^

Page 45: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 45

Proof: Define an error rv.

By the chain rule for entropyies, we have

H(E,X|X) =H(X|X) + H(E|X,X)

=H(E|X) + H(X|E,X)

Since conditioning reduces entropy, H(E|X) H(E)= H(Pe). Now since E is a function of X and X H(E|X,X)=0. Since E is a binary-valued r.v., H(E)= H(Pe).

The remaining term, H(X|E,X), can be bounded as follows:

H(X|E,X) = Pr(E=0)H(X|X,E=0)+Pr(E=1)H(X|X,E=1)

(1- Pe)0 + Pelog(|X|-1),

X

XE

X if ,0

X if ,1

=0

H(Pe) Pelog(|X|-1)

^ ^ ^

^ ^

^ ^ ^

^ ^ ^ ^

Page 46: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 46

Since given E=0, X=X, and given E=1, we can upper bound the conditional entropy by the log of the number of remaining outcomes (|X|-1).

H(Pe)+Pelog|X|H(X|X). By the data processing inequality, we have I(X;X)≤I(X;Y) since X→Y→X, and therefore H(X|X) H(X|Y). Thus we have H(Pe)+Pelog|X| H(X|X) H(X|Y).

Remark:

Suppose there is no knowledge of Y. Thus X must be guessed without any information. Let X{1,2,…,m} and P1P2… Pm. Then the best guess of X is X=1 and the resulting probability of error is Pe=1 - P1.

Fano’s inequality becomes

H(Pe) + Pelog(m-1) H(X)

The probability mass function

(P1, P2,…, Pm) = (1-Pe, Pe/(m-1), …, Pe/(m-1) )

achieves this bound with equality.

^

^ ^ ^ ^

^

^

Page 47: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 47

Some Properties of the Relative Entropy

1. Let n and ’n be two probability distributions on the

state space of a Markov chain at time n, and let n+1

and ’n+1 be the corresponding distributions at time

n+1. Let the corresponding joint mass function be

denoted by p and q.

That is,

p(xn, xn+1) = p(xn) r(xn+1| xn)

q(xn, xn+1) = q(xn) r(xn+1| xn)

where

r(· | ·) is the probability transition function for the

Markov chain.

Page 48: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 48

Then by the chain rule for relative entropy, we have

the following two expansions:

D(p(xn, xn+1)||q(xn, xn+1))

= D(p(xn)||q(xn)) + D(p(xn+1|xn)||q(xn+1|xn))

= D(p(xn+1)||q(xn+1)) + D(p(xn|xn+1)||q(xn|xn+1))

Since both p and q are derived from the same Markov

chain, so

p(xn+1|xn) = q(xn+1|xn) = r(xn+1|xn),

and hence

D(p(xn+1|xn)) || q(xn+1|xn)) = 0

Page 49: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 49

That is,

D(p(xn) || q(xn))

= D(p(xn+1) || q(xn+1)) + D(p(xn|xn+1) || q(xn|xn +1))

Since D(p(xn|xn+1) || q(xn|xn +1)) 0

D(p(xn) || q(xn)) D(p(xn+1) || q(xn+1))

or D(n|| ’n) D (n+1|| ’n+1)

Conclusion:

The distance between the probability mass

functions is decreasing with time n for any

Markov chain.

Page 50: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 50

2. Relative entropy D(n|| ) between a distribution n

on the states at time n and a stationary distribution

decreases with n.

In the last equation, if we let ’n be any stationary

distribution , then ’n+1 is the same stationary

distribution. Hence

D(n|| ) D (n+1|| )

Any state distribution gets closer and closer to each

stationary distribution as time passes. 0||lim

nn

D

Page 51: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 51

3. Def:A probability transition matrix [Pij],

Pij = Pr{xn+1=j|xn=i} is called doubly stochastic if

iPij=1, i=1,2,…, j=1,2,…

and

jPij=1, i=1,2,…, j=1,2,…

The uniform distribution is a stationary distribution of

P iff the probability transition matrix is doubly

stochastic.

Page 52: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 52

4. The conditional entropy H(Xn|X1) increase with n for a stationary Markov process.

If the Markov process is stationary, then H(Xn) is constant. So the entropy is non-increasing. However, it can be proved that H(Xn|X1) increases with n. This implies that:

the conditional uncertainty of the future increases.

Proof:

H(Xn|X1) H(Xn|X1, X2) (conditioning reduces entropy)

= H(Xn|X2) (by Markovity)

= H(Xn-1|X1) (by stationarity)

Similarly: H(X0|Xn) is increasing in n for any Markov chain.

Page 53: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 53

Sufficient Statistics

Suppose we have a family of probability mass function

{f(x)} indexed by , and let X be a sample from a

distribution in this family. Let T(X) be any statistic

(function of the sample) like the sample mean or sample

variance. Then

XT(X),

And by the data processing inequality, we have

I(;T(X)) I(;X)

for any distribution on . However, if equality holds, no

information is lost.

A statistic T(X) is called sufficient for if it

contains all the information in X about .

Page 54: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 54

Def:

A function T(X) is said to be a sufficient

statistic relative to the family {f(x)} if X is

independent of give T(X), i.e., T(X)X

forms a Markov chain.

or:

I(;X) = I(; T(X))

for all distributions on

Sufficient statistics preserve mutual information.

Page 55: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 55

Some examples of Sufficient Statistics

1. Let be an i.i.d.

sequence of coin tosses of a coin with unknown

parameter .

Given n, the number of 1’s is a sufficient statistics

for θ.

Here

Given T, all sequences having that many 1’s are

equally likely and independent of the parameter θ.

{0,1} in21 , X, X,, XX

1)Pr(Xθ i

.)(1

n

i

in21 X, X,, XXT

Page 56: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 56

.

),...,,(,

,0

,1

),...,,(),...,,(Pr

21

1

2121

forstatisticssufficientaisTand

XXXXThus

otherwise

kxif

k

n

kxxxxXXX

ni

i

n

i

inn

Page 57: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 57

2. If X is normally distributed with mean θ

and variance 1; that is,

if

and are drawn

independently according to ,

a sufficient statistic for θ is the sample mean

This can be verified that

is independent of θ.

)1,(2

12

)( 2

Nefx

n21 , X,, XX

f

),|( nX, X,, XXP nn21

.1

1

n

i

in Xn

X

Page 58: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 58

The minimal sufficient statistics is a sufficient statistics that is a function of all other sufficient statistics.

Def:

A static T(X) is a minimal sufficient statistic related to if it is a function of every other sufficient statistic

Hence, a minimal sufficient statistic maximally compresses the information about θ in the sample. Other sufficient statistics may contain additional irrelevant information.

The sufficient statistics of the above examples are minimal.

)(XfXXUXTU )()(:

Page 59: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

Information Theory 59

Shuffles increase Entropy:

If T is a shuffle (permutation) of a deck of cards and X is the initial (random) position of the cards in the deck and if the choice of the shuffle T is independent of X, then

H(TX) H(X)

where TX is the permutation of the deck induced by the shuffle T on the initial permutation X.

Proof: H(TX) H(TX|T)

= H(T-1TX|T) (why?)

= H(X|T)

= H(X)

if X and T are independent!

Page 60: Entropy, Relative Entropy and Mutual Informationitct/slide/... · 2012-02-29 · Information Theory 8 Relative Entropy and Mutual Information The entropy of a random variable is a

If X and X’ are i.i.d. with entropy H(X), then Pr(X=X’) 2-H(X)

with equality iff X has a uniform distribution.

pf: suppose X~p(x). By Jensen’s inequality, we have

2Elogp(x) E2logp(x)

which implies that 2-H(X)=2∑p(x)logp(x) ∑p(x)2logp(x)=

∑p2(x)=Pr(X=X’)

( Let X and X’ be two i.i.d. rv’s with entropy H(X). The prob.

at X=X’ is given by Pr(X=X’)= ∑p2(x) )

Let X, X’ be independent with X~p(x), X’~r(x), x, x’

Then Pr(X=X’) 2-H(p)-D(p||r)

Pr(X=X’) 2-H(r)-D(r||p)

pf: 2-H(p)-D(p||r)= 2∑p(x)logp(x)+∑p(x)logr(x)/p(x)=2∑p(x)logr(x)

∑p(x)2logr(x) = ∑p(x)r(x) =Pr(X=X’) Information Theory 60

x

*Notice that, the function f(y)=2y is convex