19
Entropy, Relative Entropy, and Mutual Information Some basic notions of Information Theory Radu Trˆ ımbit ¸as ¸ October 2012 Outline Contents 1 Entropy and its Properties 1 1.1 Entropy ................................. 1 1.2 Joint Entropy and Conditional Entropy .............. 3 1.3 Relative Entropy and Mutual Information ............. 5 1.4 Relationship between Entropy and Mutual Information ..... 6 1.5 Chain Rules for Entropy, Relative Entropy and Mutual information 8 2 Inequalities in Information Theory 10 2.1 Jensen inequality and its consequences ............... 10 2.2 Log sum inequality and its applications .............. 13 2.3 Data-processing inequality ...................... 14 2.4 Sufficient statistics ........................... 15 2.5 Fano’s inequality ........................... 16 1 Entropy and its Properties 1.1 Entropy Entropy of a discrete RV a measure of uncertainty of a random variable X a discrete random variable X x i p i iI , X alphabet of X, p( x)= P(X = x), mass function of X 1

Entropy, Relative Entropy, and Mutual Informationmath.ubbcluj.ro/~tradu/TI/coverch2_article.pdf · 2013-09-24 · Figure 3: Graphical representation of the relation between entropy

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Entropy, Relative Entropy, and Mutual Informationmath.ubbcluj.ro/~tradu/TI/coverch2_article.pdf · 2013-09-24 · Figure 3: Graphical representation of the relation between entropy

Entropy, Relative Entropy, and MutualInformation

Some basic notions of Information Theory

Radu Trımbitas

October 2012

Outline

Contents

1 Entropy and its Properties 11.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Joint Entropy and Conditional Entropy . . . . . . . . . . . . . . 31.3 Relative Entropy and Mutual Information . . . . . . . . . . . . . 51.4 Relationship between Entropy and Mutual Information . . . . . 61.5 Chain Rules for Entropy, Relative Entropy and Mutual information 8

2 Inequalities in Information Theory 102.1 Jensen inequality and its consequences . . . . . . . . . . . . . . . 102.2 Log sum inequality and its applications . . . . . . . . . . . . . . 132.3 Data-processing inequality . . . . . . . . . . . . . . . . . . . . . . 142.4 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5 Fano’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1 Entropy and its Properties

1.1 Entropy

Entropy of a discrete RV

• a measure of uncertainty of a random variable

• X a discrete random variable

X ∼(

xipi

)i∈I

, X alphabet of X, p(x) = P(X = x), mass function of X

1

Page 2: Entropy, Relative Entropy, and Mutual Informationmath.ubbcluj.ro/~tradu/TI/coverch2_article.pdf · 2013-09-24 · Figure 3: Graphical representation of the relation between entropy

Figure 1: Graph of H(p)

Definition 1. The entropy of the discrete random variable X

H(X) = − ∑x∈X

p(x) log p(x) (1)

H(X) = Ep

(log

1p(x)

)equivalent expression (2)

• measured in bits!

• base 2! Hb(X) entropy in base b; for b = e, measured in nats!

• convention 0 log 0 = 0, since limx↘0 x log x = 0

Entropy - Properties

Lemma 2. H(X) ≥ 0

Lemma 3. Hb(X) = logb aHa(X)

Example 4. Let the RV

X :(

0 11− p p

)H(X) = −p log p− (1− p) log(1− p) =: H(p) (3)

H(X) = 1 bit when p = 12 . Graph in Figure 1

Entropy - Properties II

2

Page 3: Entropy, Relative Entropy, and Mutual Informationmath.ubbcluj.ro/~tradu/TI/coverch2_article.pdf · 2013-09-24 · Figure 3: Graphical representation of the relation between entropy

Entropy - Properties III

X :(

a b c d12

14

18

18

)The entropy of X is

H(X) = −12

log212− 1

4log2

14− 1

8log2

18− 1

8log2

18=

74

bits

Problem: Determine the value of X with the minimum number of binaryquestions.

Sol: Is X = a? Is X = b? Is X = c? The resulting expected numberis 7

4 = 1.75 bits. See Lectures on Data Compression: the minimum expectednumber of binary question required to determine X lies between H(X) andH(X) + 1.

1.2 Joint Entropy and Conditional Entropy

Joint Entropy and Conditional Entropy

• (X, Y) a pair of discrete RVs over the alphabets X ,Y

X :(

xipi

)i∈I

, Y :(

yjqj

)j∈J

• joint distribution of X and Y

p(x, y) = P(X = x, Y = y), x ∈ X , y ∈ Y

• (marginal) distribution of X

pX(x) = p(x) = P(X = x) = ∑y∈Y

p(x, y)

• (marginal) distribution of Y

pY(y) = p(y) = P(Y = y) = ∑x∈X

p(x, y)

Definition 5. The joint entropy H(X, Y) of a pair of DRV (X, Y) ∼ p(x, y)

H(X, Y) = − ∑x∈X

∑y∈Y

p(x, y) log p(x, y) (4)

also expressed as H(X, Y) = −E (log p(X, Y))

3

Page 4: Entropy, Relative Entropy, and Mutual Informationmath.ubbcluj.ro/~tradu/TI/coverch2_article.pdf · 2013-09-24 · Figure 3: Graphical representation of the relation between entropy

Definition 6. (X, Y) ∼ p(x, y), the conditional entropy H(Y|X)

H(Y|X) = ∑x∈X

p(x)H(Y|X = x) (5)

where

H(Y|X = x) := − ∑y∈Y

p(y|x) log p(y|x)

p(y|x) := P(Y = y|X = x) =P(Y = y, X = x)

P (X = x)︸ ︷︷ ︸conditional probability

=p(x, y)p(x)

• By computation

H(Y|X) = − ∑x∈X

p(x) ∑y∈Y

p(y|x) log p(y|x) (6)

= − ∑x∈X

∑y∈Y

p(x, y) log p(y|x) (7)

= −E (log p (Y|X)) (8)

• naturalness of last two definitions ←− the entropy of a pair of RVs isthe entropy of one plus the conditional entropy of the other – see nexttheorem

Theorem 7 (Chain Rule).

H(X, Y) = H(X) + H(Y|X). (9)

Proof.

H(X, Y) = − ∑x∈X

∑y∈Y

p(x, y) log p(x, y)

= − ∑x∈X

∑y∈Y

p(x, y) log p(x)p(y|x)

= − ∑x∈X

∑y∈Y

p(x, y) log p(x)− ∑x∈X

∑y∈Y

p(x, y)p(y|x)

= − ∑x∈X

p(x) log p(x)− ∑x∈X

∑y∈Y

p(x, y)p(y|x)

= H(X) + H(Y|X).

Equivalently (shorter proof): we can write

log p(X, Y) = log p(X) + log p(Y|X)

and apply E to both sides.

4

Page 5: Entropy, Relative Entropy, and Mutual Informationmath.ubbcluj.ro/~tradu/TI/coverch2_article.pdf · 2013-09-24 · Figure 3: Graphical representation of the relation between entropy

Joint Entropy and Conditional Entropy - b

Corollary 8.H(X, Y|Z) = H(X|Z) + H(Y|Z, X).

Example 9. Let (X, Y) have the joint distribution

Y\X 1 2 3 41 1

81

16132

132

2 116

18

132

132

3 116

116

116

116

4 14 0 0 0

• marginal distributions X:( 1

214

18

18

); Y:

( 14

14

14

14

)• H(X) = 7

4 bits, H(Y) = 2bits

• conditional entropy

H(X|Y) =4

∑i=1

p (Y = i) H (X|Y = i)

=14

H(

12

,14

,18

,18

)+

14

H(

12

,14

,18

,18

)+

14

H(

14

,14

,14

,14

)+

14

H (1, 0, 0, 0)

=14·(

74+

74+ 2 + 0

)=

118

bits

• H(Y|X) = 138 bits and H(X, Y) = 27

8 bits.

• Remark. If H(X) 6= H(Y) then H(Y|X) 6= H(X|Y). However H(X)−H(X|Y) = H(Y)− H(Y|X).

1.3 Relative Entropy and Mutual Information

Relative Entropy and Mutual Information

Definition 10. The relative entropy or Kullback-Leibler distance between p(x) andq(x)

D (p ‖ q) = ∑x∈X

p(x) logp(x)q(x)

= Ep

(log

p(x)q(x)

).

• Conventions: 0 log 00 = 0, 0 log 0

q = 0, p log p0 = ∞

• It is not a true distance, since it is not symmetric and does not satisfy thetriangle inequality – sometimes called Kullback-Leibler divergence.

5

Page 6: Entropy, Relative Entropy, and Mutual Informationmath.ubbcluj.ro/~tradu/TI/coverch2_article.pdf · 2013-09-24 · Figure 3: Graphical representation of the relation between entropy

Definition 11. (X, Y) ∼ p(x, y), p(x), p(y) mass functions; the mutual informa-tion I(X; Y) is the relative entropy between p(x, y) and p(x)p(y) :

I(X; Y) = D (p(x, y) ‖ p(x)p(y)) (10)

= ∑x∈X

∑y∈Y

p(x, y) logp(x, y)

p(x)p(y)(11)

= Ep(x,y)

(log

p(X, Y)p(X)p(Y)

). (12)

Remark. D(p ‖ q) 6= D(q ‖ p), as the next example shown.

Interpretation. I(X; Y) measures the average reduction in uncertainty of Xthat results from knowing Y.

Example 12. X = {0, 1}, p(0) = 1− r, p(1) = r, q(0) = 1− s, q(1) = s.

D(p ‖ q) = (1− r) log1− r1− s

+ r logrs

D(q ‖ p) = (1− s) log1− s1− r

+ s logsr

If r = s, then D(p ‖ q) = D(q ‖ p), but for r = 12 , s = 1

4

D(p ‖ q) =12

log1234+

12

log1214= 0.207 52 bit

D(q ‖ p) =34

log3412+

14

log1412= 0.188 72 bit

Example - relative entropyD(p ‖ q) = (1− r) log 1−r

1−s + r log rs

1.4 Relationship between Entropy and Mutual Information

Relationship between Entropy and Mutual Information

Theorem 13 (Mutual information and entropy).

I(X; Y) = H(X)− H(Y|X) (13)I(X; Y) = H(Y)− H(X|Y) (14)I(X; Y) = H(X) + H(Y)− H(X, Y) (15)I(X; Y) = I(Y, X) (16)I(X, X) = H(X) (17)

6

Page 7: Entropy, Relative Entropy, and Mutual Informationmath.ubbcluj.ro/~tradu/TI/coverch2_article.pdf · 2013-09-24 · Figure 3: Graphical representation of the relation between entropy

Figure 2: Relative entropy (Kullback-Leibler distance) of two Bernoulli RVs

Proof. (13)

I(X; Y) = ∑x∈X

∑y∈Y

p(x, y) logp(x, y)

p(x)p(y)= ∑

x∈X∑

y∈Yp(x, y) log

p(x|y)p(x)

= − ∑x∈X

∑y∈Y

p(x, y)︸ ︷︷ ︸p(x)

log p(x) + ∑x∈X

∑y∈Y

p(x, y) log p(x|y)

= H(X)−(− ∑

x∈X∑

y∈Yp(x, y) log p(x|y)

)= H(X)− H(X|Y)

(14) by symmetry(15) results from (13) and H(X, Y) = H(Y)− H(X|Y); (15)=⇒(16)Finally, I(X; X) = H(X)− H(X|X) = H(X).

Relationship between entropy and mutual informationExample 14. For the joint distribution of Example 9 the mutual information is

I(X; Y) = H(X)− H(X|Y) = H(Y)− H(Y|X) = 0.375 bit

The relationship between H(X), H(Y), H(X, Y), H(X|Y), H(Y|X), and I(X; Y)is depicted in Figure 4. Notice that I(X; Y) corresponds to the intersection ofthe information in X with the information in Y.

Relationship between entropy and mutual information

7

Page 8: Entropy, Relative Entropy, and Mutual Informationmath.ubbcluj.ro/~tradu/TI/coverch2_article.pdf · 2013-09-24 · Figure 3: Graphical representation of the relation between entropy

Figure 3: Graphical representation of the relation between entropy and mutualinformation

Relationship between entropy and mutual information - graphical

1.5 Chain Rules for Entropy, Relative Entropy and Mutual in-formation

Chain rules for entropy, relative entropy and mutual information

Theorem 15 (Chain rule for entropy). X1, X2, . . . , Xn ∼ p(x1, x2, . . . , xn)

H (X1, X2, . . . , Xn) =n

∑i=1

H (Xi|Xi−1, . . . , X1)

Proof. Apply repeatedly the two variable expansion rule for entropy

H(X1, X2) = H(X1) + H(X2|X1),H(X1, X2, X3) = H(X1) + H(X2, X3|X1)

= H(X1) + H(X2|X1) + H(X3|X2, X1)

...H (X1, X2, . . . , Xn) = H(X1) + H(X2|X1) + · · ·+ H (Xn|Xn−1, . . . , X1)

=n

∑i=1

H (Xi|Xi−1, . . . , X1) .

8

Page 9: Entropy, Relative Entropy, and Mutual Informationmath.ubbcluj.ro/~tradu/TI/coverch2_article.pdf · 2013-09-24 · Figure 3: Graphical representation of the relation between entropy

Figure 4: Venn diagram for the relationship between entropy and mutual in-formation

Definition 16. The conditional mutual information of random variables X and Ygiven Z is defined by

I (X; Y|Z) = H(X|Z)− H(X|Y, Z) (18)

= Ep(x,y,z) logP(X, Y|Z)

P(X|Z)P(Y|Z) (19)

Theorem 17 (Chain rule for information).

I(X1, X2, . . . , Xn; Y) =n

∑i=1

I (Xi; Y|Xi−1, Xi−2, . . . , X1) . (20)

Proof.

I(X1, X2, . . . , Xn; Y)= H (X1, X2, . . . , Xn)− H (X1, X2, . . . , Xn|Y) (21)

=n

∑i=1

H (Xi|Xi−1, . . . , X1)−n

∑i=1

H (Xi|Xi−1, . . . , X1, Y)

=n

∑i=1

I (Xi; Y|X1, X2, . . . , Xi−1) (22)

Definition 18. For joint probability mass functions p(x, y) and q(x, y), the con-

9

Page 10: Entropy, Relative Entropy, and Mutual Informationmath.ubbcluj.ro/~tradu/TI/coverch2_article.pdf · 2013-09-24 · Figure 3: Graphical representation of the relation between entropy

ditional relative entropy is

D (p(x, y) ‖ q(x, y)) = ∑x

p(x)∑y

p(y|x) logp(y|x)q(y|x) (23)

= Ep(x,y) logp (Y|X)

q (Y|X). (24)

The notation is not explicit since it omits the mention of the distributionp(x) of the conditioning RV. However it is normally understood from the con-text.

Theorem 19 (Chain rule for relative entropy).

D (p(x, y) ‖ q(x, y)) = D(p(x) ‖ q(x)) + D (p(y|x) ‖ q(y|x)) (25)

Proof.

D(p(x, y)||q(x, y))

= ∑x

∑y

p(x, y) logp(x, y)q(x, y)

= ∑x

∑y

p(x, y) logp(x)p(y|x)q(x)q(y|x)

= ∑x

∑y

p(x, y) logp(x)q(x)

+ ∑x

∑y

p(x, y) logp(y|x)q(y|x)

= D(p(x) ‖ q(x)) + D (p(y|x) ‖ q(y|x)) .

2 Inequalities in Information Theory

2.1 Jensen inequality and its consequences

Jensen inequalityConvexity underlies many of the basic properties of information-theoretic

quantities such as entropy and mutual information.

Definitions 20. 1. A function f (x) is convex∪ over an interval (a, b) if forevery x1, x2 ∈ (a, b) and 0 ≤ λ ≤ 1

f (λx1 + (1− λ) x2) ≤ λ f (x1) + (1− λ) f (x2). (26)

2. f is strictly convex if equality holds only for λ = 0 and λ = 1.

3. f is concave∩ if − f is convex.

10

Page 11: Entropy, Relative Entropy, and Mutual Informationmath.ubbcluj.ro/~tradu/TI/coverch2_article.pdf · 2013-09-24 · Figure 3: Graphical representation of the relation between entropy

• A function is convex if it always lies below any chord. A function isconcave if it always lies above any chord.

• Examples of convex functions: x2, |x|, ex, x log x for x ≥ 0.

• Example of concave functions: log x and√

x for x ≥ 0.

• If f ′′ nonnegative (positive) then f is convex (strictly convex)

Theorem 21 (Jensen’s inequality). If f is a convex function and X is a RV

E( f (X)) ≥ f (E(X)). (27)

If f is strictly convex, equality in (27) implies X = E(X) with probability 1 (i.e. X isa constant).

Proof. for discrete RV induction on number of mass points.For a two-mass-point distribution, we apply the definition

f (p1x1 + p2x2) ≤ p1 f (x1) + p2 f (x2)

Suppose true for k− 1; we set p′i = pi/ (1− pk)

f

(k

∑i=1

pixi

)= f

(pkxk + (1− pk)

k−1

∑i=1

p′ixi

)

≤ pk f (xk) + (1− pk) f

(k−1

∑i=1

p′ixi

)

≤ pk f (xk) + (1− pk)k−1

∑i=1

p′i f (xi) =k

∑i=1

pi f (xi) .

Extension to continuous distributions using continuity arguments.

Interpretation of convexity

Consequences of Jensen Inequality

• We will use Jensen to prove properties of entropy and relative entropy.

Theorem 22 (Information inequality, Gibbs’ inequality). p(x), q(x), x ∈ X pmf

D (p ‖ q) ≥ 0 (28)

with equality iff p(x) = q(x), ∀x ∈ X .

11

Page 12: Entropy, Relative Entropy, and Mutual Informationmath.ubbcluj.ro/~tradu/TI/coverch2_article.pdf · 2013-09-24 · Figure 3: Graphical representation of the relation between entropy

Proof. Let A := {x : p(x) > 0}

D (p ‖ q) = ∑x∈A

p(x) logp(x)q(x)

= ∑x∈A

p(x)(− log

q(x)p(x)

)

≥ − log

(∑

x∈Ap(x)

q(x)p(x)

)(- log is strictly convex)

= − log ∑x∈A

q(x) = − log 1 = 0

Equality hold iff q(x)p(x) = c, ∀x ∈ X . But, 1 = ∑x∈A q(x) = ∑x∈X q(x) =

c ∑x∈X p(x) = c, so p(x) = q(x), ∀x ∈ X .

Since I(X, Y) = D(p(x, y) ‖ p(x)q(x)) ≥ 0, with equality iff p(x, y) =p(x)q(x) (i.e. X and Y are independent) we obtain

Corollary 23.I(X, Y) ≥ 0, (29)

with equality iff X and Y are independent.

Corollary 24.I(X; Y|Z) ≥ 0, (30)

with equality iff X and Y are conditionally independent given Z.

Any random variable over X has an entropy no greater than log |X |.

12

Page 13: Entropy, Relative Entropy, and Mutual Informationmath.ubbcluj.ro/~tradu/TI/coverch2_article.pdf · 2013-09-24 · Figure 3: Graphical representation of the relation between entropy

Theorem 25. H(X) ≤ log |X |, with equality iff X ∼ U(X ).

Proof. p(x) pmf of X, u(x) = 1|X | pmf of uniform distribution over X .

0 ≤ D(p ‖ u) = ∑x∈X

p(x) logp(x)u(x)

= log |X | − H(X).

The next theorem states that conditioning reduces entropy (or informationcannot hurt).

Theorem 26.H(X|Y) ≤ H(X)

with equality iff X and Y are independent.

Proof. 0 ≤ I(X, Y) = H(X)− H(X|Y).

Corollary 27 (Independence bound on entropy).

H(X1, X2, . . . , X) ≤n

∑i=1

H(Xi)

Proof. Chain rule for entropy (Theorem 15)

H(X1, X2, . . . , X) =n

∑i=1

H(Xi|Xi−1, . . . , X1) ≤ H(Xi) ( ≤ from Th. 26)

2.2 Log sum inequality and its applications

Log sum inequality and its applications

Theorem 28 (Log sum inequality). a1, . . . , an and b1, . . . , bn nonnegative numbers

n

∑i=1

ai logaibi≥(

n

∑i=1

ai

)log

∑ni=1 ai

∑ni=1 bi

(31)

with equality iff aibi= const.

Conventions: 0 log 0 = 0, a log a0 = ∞ if a > 0 and 0 log 0

0 = 0 (by continu-ity)

13

Page 14: Entropy, Relative Entropy, and Mutual Informationmath.ubbcluj.ro/~tradu/TI/coverch2_article.pdf · 2013-09-24 · Figure 3: Graphical representation of the relation between entropy

Proof. Assume w.l.o.g. ai > 0, bi > 0. Since f (t) = t log t is convex for t > 0,by Jensen ineq.

∑ αi f (ti) ≥ f(∑ αiti

), αi ≥ 0, ∑ αi = 1.

Setting αi =bi

∑ bjand ti =

aibi

, we obtain

∑ai

∑ bjlog

aibi≥∑

ai

∑ bjlog ∑

ai

∑ bj,

the desired inequality.

Homework. Prove Theorem 22 using log sum inequality.Using log sum inequality it is easy to prove convexity and concavity results

for relative entropy, entropy and mutual information. See [1, Section 2.7].

2.3 Data-processing inequality

Data-processing inequality

Definition 29. Random variables X, Y, Z are said to form a Markov chain in thatorder (denoted by X → Y → Z) if the conditional distribution of Z dependsonly on Y and is conditionally independent of X. Specifically, X, Y, and Zform a Markov chain X → Y → Z if the joint probability mass function can bewritten as

p(x, y, z) = p(x)p(y|x)p(z|y). (32)

Consequences:

• X → Y → Z iff X and Z are conditionally independent given Y (i.e.p(x, z|y) = p(x|y)p(z|y). Markovity implies conditional independencebecause

p(x, z|y) = p(x, y, z)p(y)

=p(x, y)p(z|y)

p(y)= p(x|y)p(z|y). (33)

• X → Y → Z =⇒ Z → Y → X, sometimes written as X ←→ Y ←→ Z.(reversibility)

• If Z = f (Y), then X → Y → Z

We will prove that no processing of Y, deterministic or random, can increase theinformation that Y contains about X.

Theorem 30 (Data-processing inequality). If X → Y → Z, then I(X; Y) ≥I(X; Z).

14

Page 15: Entropy, Relative Entropy, and Mutual Informationmath.ubbcluj.ro/~tradu/TI/coverch2_article.pdf · 2013-09-24 · Figure 3: Graphical representation of the relation between entropy

Proof. By chain rule (20), we expand mutual information in two different ways:

I(X, Y, Z) = I(X; Z) + I(X; Y|Z) (34)= I(X; Y) + I(X; Z|Y). (35)

X, Z conditionally independent =⇒ I(X; Z|Y) = 0; since I(X; Y|Z) ≥ 0 wehave

I(X; Y) ≥ I(X; Z).

We have equality iff I(X; Y|Z) = 0, that is X → Z → Y forms a Markov chain.Similarly, one can prove that I(Y; Z) ≥ I(X; Z).

Corollary 31. In particular, if Z = g(Y), we have I(X; Y) ≥ I(X; g(Y)).

Proof. X → Y → g(Y) forms a Markov chain.

Functions of the data Y cannot increase the information about X.

Corollary 32. If X → Y → Z, then I(X; Y|Z) ≤ I(X; Y).

Proof. In (34), (35) we have I(X; Z|Y) (by Markovity) and I(X; Z) ≥ 0. Thus

I(X; Y|Z) ≤ I(X; Y). (36)

If X, Y, Z do not form a Markov chain it is possible that I(X; Y|Z) > I(X; Y).For example, if X and Y are independent fair binary RVs and Z = X + Y, thenI(X; Y) = 0, I(X; Y|Z) = H(X|Z) − H(X|Y, Z) = H(X|Z). But, H(X|Z) =P(Z = 1)H(X|Z = 1) = 1

2 bit.

2.4 Sufficient statistics

Sufficient statistics

• We apply data-processing inequality in statistics

• { fθ(x)} family of pmfs, X ∼ fθ(x), T(X) statistics

• θ → X → T(X); data-processing inequality (Theorem 30) implies

I(θ; T(X)) ≤ I(θ; X) (37)

with equality when no information is lost.

• A statistic T(X) is called sufficient for θ if it contains all the informationin X about θ.

Definition 33. A function T(X) is said to be a sufficient statistic relative to thefamily { fθ(x)} if X is independent of θ given T(X) for any distribution on θ(i.e. θ → X → T(X) forms a Markov chain).

15

Page 16: Entropy, Relative Entropy, and Mutual Informationmath.ubbcluj.ro/~tradu/TI/coverch2_article.pdf · 2013-09-24 · Figure 3: Graphical representation of the relation between entropy

The definition is equivalent to the condition of equality in data-processinginequality

I(θ; T(X)) = I(θ; X) (38)

for all distributions on θ. Hence sufficient statistics preserve mutual informa-tion and conversely.

Examples(sufficient statistics)

1. X1, X2, . . . , Xn, Xi ∈ {0, 1}, a sequence of i.i.d. Bernoullian variableswith parameter θ = P(Xi = 1). Given n, the number of 1’s is a sufficientstatistics for θ

T(X1, . . . , Xn) =n

∑i=1

Xi.

2. If X ∼ N(θ, 1), that is

fθ(x) =1√2π

e−(x−θ)2

2

and X1, X2, . . . , Xn is a sample of i.i.d. N(θ, 1) RVs, then Xn = 1n ∑n

i=1 Xiis a sufficient statistic.

3. fθ(x) pdf for U(θ, θ + 1) - a sufficient statistic for θ is

T(X1, . . . , Xn) = (min{Xi}, max{Xi}) .

Definition 34. A statistic T(X) is a minimal sufficient statistic relative to { fθ(x)}if it is a function of every other sufficient statistic U,

θ → T(X)→ U(X)→ X.

Hence, a minimal sufficient statistic maximally compresses the informationabout θ in the sample.

2.5 Fano’s inequality

Fano’s inequality

• Suppose we wish to estimate X ∼ p(x)

• We observe Y related to X by the conditional distribution p(y|x). From Ywe calculate g(Y) = X; X is an estimate of X over the alphabet X .

• X → Y → X forms a Markov chain

• Define the probability of error

Pe = P{

X 6= X}

.

16

Page 17: Entropy, Relative Entropy, and Mutual Informationmath.ubbcluj.ro/~tradu/TI/coverch2_article.pdf · 2013-09-24 · Figure 3: Graphical representation of the relation between entropy

Theorem 35. For any estimator X such that X → Y → X, with Pe = P{

X 6= X}

H(Pe) + Pe log |X | ≥ H(

X|X)≥ H(X|Y). (39)

This inequality can be weakened to

1 + Pe log |X | ≥ H(X|Y) (40)

or

Pe ≥H(X|Y)− 1

log |X | . (41)

Proof. For the first part we define the RV

E =

{1 if X 6= X,0 if X = X.

We expand H(E, X|X) in two ways using the chain rule

H(E, X|X) = H(X|X) + H(E|X, X)︸ ︷︷ ︸=0

(42)

= H(E|X)︸ ︷︷ ︸≤H(Pe)

+ H(E|X, X)︸ ︷︷ ︸≤Pe log |X |

. (43)

Proof - continuation. • Since E is a function of X and X, H(E|X, X) = 0.

• H(E|X) ≤ H(E) = H(Pe) (conditioning reduce entropy)

• Since for E = 0, X = X and for E = 1 entropy is less than the number ofpossible outcomes

H(E|X, X) = P(E = 0)H(X|X, E = 0) + P(E = 1)H(X|X, E = 1)≤ (1− Pe) · 0 + Pe log |X | (44)

Proof - continuation. Combining these results, we obtain

H(Pe) + Pe log |X | ≥ H(X|X)

X → Y → X Markov chain =⇒ I(X; X) ≤ I(X; Y) =⇒ H(X|X) ≥ H(X|Y).Finally,

H(Pe) + Pe log |X | ≥ H(

X|X)≥ H(X|Y).

17

Page 18: Entropy, Relative Entropy, and Mutual Informationmath.ubbcluj.ro/~tradu/TI/coverch2_article.pdf · 2013-09-24 · Figure 3: Graphical representation of the relation between entropy

If we set X = Y in Fano’s inequality, we obtain

Corollary 36. For any two RVs X and Y, let p = P (X 6= Y). Then

H(p) + p log |X | ≥ H(X|Y). (45)

If the estimator g(Y) takes values inX , we can replace log |X | by log (|X | − 1).

Corollary 37. Let Pe = P(X 6= X), and let X : Y → X ; then

H(Pe) + Pe log (|X | − 1) ≥ H(X|Y).

Proof. Like the proof of Theorem 35, excepting that in (44), the range of possibleX outcomes has the cardinal |X | − 1.

Remark. Suppose there is no knowledge of Y. Thus, X must be guessedwithout any information. Let X ∈ {1, 2, . . . , m} and p1 ≥ p2 ≥ · · · ≥ pm. Thenthe best guess of X is X = 1 and the resulting probability error is Pe = 1− p1.Fano’s inequality becomes

H(Pe) + Pe log(m− 1) ≥ H(X).

The pmf

(p1, p2, . . . , pm) =

(1− Pe,

Pe

m− 1, . . . ,

Pe

m− 1

)achieves this bound with equality: Fano’s inequality is sharp!

Next results relates probability of error and entropy. Let X and X′ be i.i.d.RVs with entropy H(X).

P(X = X′) = ∑x

p2(x).

Lemma 38. If X and X′ are i.i.d. RVs with entropy H(X),

P(X = X′) ≥ 2−H(x), (46)

with equality iff X has uniform distribution.

Proof. Suppose X ∼ p(x); Jensen implies

2E(log p(X)) ≤ E(

2log P(X))

2−H(X) = 2∑ p(x) log p(x) ≤∑ p(x)2log p(x) = ∑ p2(x).

Corollary 39. Let X˜p(x), X′ ∼ r(x), independent RVs over X . Then

P(X = X′) ≥ 2−H(p)−D(p‖r) (47)

P(X = X′) ≥ 2−H(r)−D(r‖p). (48)

18

Page 19: Entropy, Relative Entropy, and Mutual Informationmath.ubbcluj.ro/~tradu/TI/coverch2_article.pdf · 2013-09-24 · Figure 3: Graphical representation of the relation between entropy

Proof.

2−H(p)−D(p‖r) = 2∑ p(x) log p(x)+∑ p(x) log r(x)p(x)

= 2∑ p(x) log r(x)

From Jensen and convexity of f (y) = 2y it follows

2−H(p)−D(p‖r) ≤∑ p(x)2log r(x)

= ∑ p(x)r(x) = P(X = X′).

References

References

[1] Thomas M. Cover, Joy A. Thomas, Elements of Information Theory, 2nd edi-tion, Wiley, 2006.

[2] David J.C. MacKay, Information Theory, Inference, and Learning Algorithms,Cambridge University Press, 2003.

[3] Robert M. Gray, Entropy and Information Theory, Springer, 2009

19