356B33DAd01

8/13/2019 356B33DAd01

1/7

Information and Entropy

Information uncertainty

Entropy:

H(X) = ip(xi) logp(xi)

Equivalently

H(X) = E

log

1

p(X)

Example: Bernoulli r.v. w.p. p

Entropy is always non-negative

(why?)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p

H(p)

Aria Nosratinia Information Theory 2-1

Joint and Conditional Entropy

H(X, Y) =i

j

p(xi, yj)logp(xi, yj)

=EX,Y

log

1

p(X, Y)

H(Y|X) = i

H(Y|X=xi)pX(xi)

= i

j

p(yj |xi)logp(yj |xi)

p(xi)

= i

j

p(xi, yj)logp(yj |xi)

=EX,Y

log

1

p(Y|X)


8/13/2019 356B33DAd01

2/7

Chain Rule

This is one of the most useful information equalities

H(X, Y) =H(X) +H(Y|X)

Can you think of an intuitive explanation for it?

Chain rule for conditional entropies:

H(X, Y|Z) =H(X|Z) +H(Y|X,Z)

Chain rule applied multiple times:

H(X1, . . . ,X n) =H(X1) +H(X2|X1) +. . .+H(Xn|Xn1, . . . ,X 1)


Information Divergence

Kullback-Leibler distance or information divergence

D(p||q) =xX

p(x)logp(x)

q(x)

Notes:

D(p||q) is not symmetric

D(p||p) = 0

Represents distance between two distributions

Can characterize error probability in detection


8/13/2019 356B33DAd01

3/7

D(p||q) For Brenoulli R.V.

0

0.5

1

00.2

0.40.6

0.810

2

4

6

8

10

pq

D(p||q)


Mutual Information

Mutual information: the information of one r.v. about another

I(X;Y) =Dp(x, y)||p(x)p(y)

=x,y

p(x, y)log p(x, y)

p(x)p(y)

=EX,Y

log

p(X, Y)

p(X)p(Y)

I(X;Y) =H(X) H(X|Y) (why?)

So we define conditional mutual information:

I(X;Y|Z) =H(X|Z) H(X|Y, Z)

Chain Rule:

I(X;Y1, . . . , Y n) =ni=1

I(X;Yi|Yi1, . . . , Y 1)


8/13/2019 356B33DAd01

4/7

Entropy Relationships

H(X) H(Y)

I(X;Y) H(Y|X)H(X|Y)

H(X,Y)

H(X, Y) =H(X) +H(Y|X) I(X;Y) =H(X) H(X|Y)

=H(Y) +H(X|Y) =H(Y) H(Y|X)

H(X, Y) =H(X) +H(Y) I(X;Y)


Jensens Inequality

ConcaveConvex

Neither

f() is convex if for any 0 1

fx1+ (1 )x2 f(x1) + (1 )f(x2)Jensens Inequality: If a function f() is convex, then

Ef(X)

f

E[X]

Iff() is strictly convex, equality is achieved if and only ifX is trivial.

Proof: Use induction, definition of convexity, & continuity arguments.


8/13/2019 356B33DAd01

5/7

Properties of KL and Mutual Information

D(p||q) 0

Proof:

D(p||q) = i

pilogqi

pi log

i

piqi

pi= log

i

qi= 0

I(X;Y) 0

Proof:

I(X;Y) =Dp(x, y)||p(x)p(y)

0

I(X;Y|Z) 0

Proof: I(X;Y|Z) =Dp(x, y|z)||p(x|z)p(y|z)

0


Some Inequalities

H(X1, . . . ,X n)

iH(Xi)

(Independence bound)

Proof: Use chain rule

H(X) log |X |

(Uniform distribution maximizes entropy)

Proof: D(pX ||u) = log(|X |) H(X) 0

H(X|Y) H(X)

(Conditioning reduces entropy)

Proof: H(X) H(X|Y) =I(X;Y) 0


8/13/2019 356B33DAd01

6/7

Convexity/Concavity of Information Functions

D(p||q) is convex in the pair (p, q)

Proof: Uses the log-sum inequality

H(X) is concave

I(X;Y) is a convex function of p(y|x) for fixed p(x), and concave

function ofp(x) for fixed p(y|x).

Proof: Using the convexity ofD(p||q).


Data Processing Inequality

X,Y,Zform a Markov chain (shown X Y Z) if

p(x,y,z) =p(y)p(x|y)p(z|y)

Then

I(X;Y) I(X;Z)

Proof:

I(X;Y, Z) =I(X;Z) +I(X;Y|Z) = I(X;Y)+

0

I(X;Z|Y)In particular X Y g(Y) (why?), then

I(X;Y) I(X, g(Y)),

Processing ofY cannot increase the information ofY about X.

Question: Then why should we ever do signal processing?

Corollary: I(X;Y|Z) I(X;Y)


8/13/2019 356B33DAd01

7/7

Fanos Inequality

Want to estimate X from Y, Pe = P rob(X= X)

X Y X

Pelog |X | H(X|Y) H(Pe)

Sometimes simplified to:

Pe H(X|Y) 1

|X |

Q:Why is this useful?

A:Shows that there are limits to our ability to communicate or

estimate well, and the limits have to do with H(X|Y).


Fanos Inequality (proof)

Consider a Bernoulli R.V. indicating the error, E=IX=X .

H(E, X|X) =

H(X|Y)

H(X|X) +0

H(E|X, X)=H(E|X)

H(Pe)

+H(X|E, X) P

elog |X |


Documents

356B33DAd01