Entropy Basics

2 Entropy and Mutual Information

From the examples on information given earlier, it seems clear that

the information, i(A), provided by the occurrence of an event A should

have the following properties:

1. i(A) must be a monotonically decreasing function of its probabil-

ity:

i(A) = f [p(A)]

where f() is monotonically decreasing with p(A).

2. i(A) 0 for 0 p(A) 1.

3. i(A) = 0 if p(A) = 1.

4. i(A) > i(B) if p(A) < p(B).

5. If A and B are independent events, then i(AB) = i(AB) =

i(A) + i(B).

It turns out there is one and only one function that satisfies the

above requirements, namely the logarithmic function.

Definition 1 Let A be an event having probability of occurrence p(A).

Then the amount of information conveyed by the knowledge of the oc-

currence of A, referred to as the self-information of event A, is given

by

i(A) = log2[p(A)] bits.

1

The unit of information is in bits when the base of the logarithm

is 2, nats when the the base is e = 2.71 and nepers when a base10 logarithm is used.

Definition 2 (Mutual information) The amount of information pro-

vided by the occurrence of an event B about the occurrence of an event

A, known as the mutual information between A and B, is defined by1

i(A;B) = log

[p(A|B)p(A)

]= log p(A) [ log p(A|B)]= i(A) i(A|B).

In other words, i(A;B) is also the amount of uncertainty about event

A removed by the occurrence of event B.

2.1 Properties of Mutual Information

1.

i(A;B) = i(B;A).

Proof:

i(A;B) = logp(A|B)p(A)

= logp(A,B)

p(A)p(B)

= logp(B|A)p(B)

= I(B;A).

1Here, we abuse notation somewhat by using i() to denote self-information and i( ; ) to denotemutual information.

2

Thus,

i(A;B,C) = log

{p(A|B)p(A)

p(C|A,Bp(C|B)

}= log

p(A|B)p(A)

+ logp(C|A,B)p(C|B)

= i(A;B) + i(C;A|B)= i(A;B) + i(A;C|B).

Thus, the amount of information about event A provided by the joint

occurrence of events B and C is equal to the amount of information

provided about A by the occurrence of B plus the amount of informa-

tion provided about A by the occurrence of C given that B has already

occurred.

Example 1 Consider flipping a fair coin and providing information as

to the outcome through the following so-called Z-channel or erasure

channel. We have the following mapping from the coin outcome (H

or T) to the real field: H 0 and T 1. Let X be the binary-valuedrandom variable associated with the coin flipping.

1

1

0.5 0

10.5

X Y

0

1

Figure 1: The binary erasure channel for Example 1.

1. What is the self-information of the events X = 0, X = 1?

4

We have 2:

i(X = 0) = log p(X = 0) = log(1

2

)= 1 bit.

i(X = 1) = log p(X = 1) = log(12) = 1 bit.

2. What is the self-information of events Y = 0, Y = 1?

We have:

i(Y = 0) = log p(Y = 0).

p(Y = 0) = p(Y = 0|X = 0)p(X = 0) + p(Y = 0|X = 1)p(X = 1)= 1 p(X = 0) + p(X = 1)=

1 +

2.

i(Y = 0) = log 1 + 2

bits.

Similarly,

i(Y = 1) = log p(Y = 1).

p(Y = 1) = 1 p(Y = 0) = 1 2

.

i(Y = 1) = log 1 2

bits.

3. What is the mutual information between events X = 0 and Y = 0?

We have,

i(Y = 0;X = 0) = logp(Y = 0|X = 0)

p(Y = 0)= log

2

1 + = 1log(1+)

2Unless otherwise stated, log will denote logarithm based 2.

5


i(Y = 1;X = 0) = logp(Y = 1|X = 0)

p(Y = 1)= log(0) = .


i(Y = 0;X = 1) = logp(Y = 0|X = 1)

p(Y = 0)= log

2

1 +

6. What is the mutual information between X = 0 and Y = 0?

i(Y = 1;X = 0) = logp(Y = 1|X = 0)

p(Y = 1)= log

2(1 )1 = 1bit

2.2 Average Self-Information -Entropy

Definition 4 (Entropy) The entropy of a source represented by a ran-

dom variable X with realizations taking values from the set X is

H(X) = xX

p(x) log p(x).

From the above definition, we can write

H(X) = E[ log p(X)] = E[i(X)],

which implies that the entropy of a source is the average amount of

information produced by the source.

Clearly, since log p(x) 0, we have H(X) 0.

6

Example 2 Consider a source X that produces two symbols with equal

probability (1/2). The entropy of this source is

H(X) = 12log

(1

2

) 12log

(1

2

)= 1 bit

Example 3 Consider the tossing of a fair dice ( each outcome occurs

with equal probability 1/6). The average amount of information pro-

duced by this source is

H(X) = 16log

(1

6

) 6 = log(6) = 2.585 bits.

Note that the source in Example 3 produces more information on

average than the source in Example 2.

Definition 5 The joint entropy, H(X,Y ), of two discrete random vari-

ables X and Y is

H(X, Y ) = EX,Y [log p(X,Y )] = x

y

p(x, y) log p(x, y).

Definition 6 The conditional entropy, of Y given X is

H(Y |X) = EX,Y [log p(Y |X)] = x

y

p(x, y) log p(y|x).

Theorem 1 (Chain rule for entropy)

H(X, Y ) = H(X) +H(Y |X).

Proof: We have

log p(X,Y ) = log p(Y |X)p(X) = log p(X) log p(Y |X).

7

Taking expectations on both sides above

EX,Y [log p(X, Y )] = EX,Y [log p(X)] EX,Y [log p(Y |X)] H(X, Y ) = H(X) +H(Y |X).

The above result can be easily generalized to the entropy of a random

vector X = (X1, X2, , XN). Let Xn = (X1, X2, , Xn). Then theentropy of X can be expressed as:

H(X) =Nn=1

H(Xn|Xn1),

where we let H(X1|X0) , H(X1).Proof: We can write

p(X) =Nn=1

p(Xn|Xn1),

where we let p(X1|X0) , p(X1). Thus,

log p(X) = Nn=1

log p(Xn|Xn1).

Taking expectations with respect to the joint probability mass function

p(X) on both sides of the equation, we obtain the desired relation.

2.3 Relative Entropy and Mutual Information

Definition 7 The relative entropy or Kullback-Leibler distance between

two probability mass functions p(x) and q(x) is given by

D(pq) = Ep[log

p(x)

q(x)

]=x

p(x) logp(x)

q(x).

8

The relative entropy is a measure of the distance between the two dis-

tributions p(x) and q(x), even though it is not s true distance metric.

We defined earlier the mutual information between two events. Now

consider two random variables X X and Y Y . If x X is arealization of X and y Y is a realization of Y , the mutual informationbetween x and y is

i(x; y) = logp(x|y)p(x)

,

which is obviously a random variable over the ensemble of realizations

of X and Y . We have the following definition.

Definition 8 The mutual information between two discrete random

variables X and Y is

I(X;Y ) = E

[log

p(X|Y )p(X)

]=x

y

p(x, y) logp(x|y)p(x)

=x

y

p(x, y) logp(x, y)

p(x)p(y)

= D(p(x, y)p(x)p(y))=x

y

p(x, y) logp(y|x)p(y)

= I(Y ;X).

The mutual information between random variables X and Y is the

average amount of information provided aboutX by observing Y , which

is also the average amount of uncertainty resolved about X by observ-

ing Y . As can be seen, I(X;Y ) = I(Y ;X), i.e. Y resolves as much

9

uncertainty about X as X about Y .

We have the following relations between entropy and mutual infor-

mation:

1.

I(X;Y ) = H(X)H(X|Y ) = H(Y )H(Y |X).

Proof: We have

I(X;Y ) = E logp(X|Y )p(X

= E log p(X) E log p(X|Y )= H(X)H(X|Y ).

Since, as established earlier I(X;Y ) = I(Y ;X), we have

I(X;Y ) = H(Y )H(Y |X).

2.

I(X;Y ) = H(X)H(X|Y )= H(X) [H(X, Y )H(Y )]= H(X) +H(Y )H(X, Y ).

3. I(X;X) = H(X)H(X|X) = H(X).

The diagram in Figure 2 summarizes the relationship between the

various quantities:

10

Figure 2: Mutual information between input and output for the z-channel.

Definition 9 The conditional mutual information of discrete random

variables X and Y given Z is

I(X;Y |Z) = EXY Z[log

p(X|Y, Z)p(X|Z)

]= H(X|Z)H(X|Y, Z).

Theorem 2 (Chain rule for mutual information):

Let X = (X1, X2, , XN) be a random vector. Then the mutual in-formation between X and Y is

I(X;Y ) =Nn=1

I(Xn;Y |Xn1

).

Proof:

11

I(X;Y ) = H(X)H(X|Y )

=Nn=1

H(Xn|Xn1)Nn=1

H(Xn|Xn1, Y )

=Nn=1

[H(Xn|Xn1)H(Xn|Xn1, Y )

]=

Nn=1

I(Xn;Y |Xn1

).

2.4 Jensens Inequality

Definition 10 A function f(x) is convex over an interval (a, b) if for

all x1, x2 (a, b) and 0 1

f(x1 + (1 )x2) f(x1) + (1 )f(x2).

The function is said to be strictly convex if equality above holds only

if = 0 or = 1.

Definition 11 A function f(x) is said to be concave if f(x) is con-vex.

Theorem 3 A function f that has a non-negative (positive) second

derivative is convex (strictly convex).

Theorem 4 (Jensens inequality) Let f be a convex function and X a

random variable. Then

E[f(X)] f (E(X)) .12

If f is strictly convex, equality is if and only if X is a constant (not

random).

Proof: (by induction)

Let L be the number of values the discrete random variable X can take.

For a binary random variable, i.e. L = 2, with probabilities p1 and p2

the inequality becomes

p1f(x1) + p2f(x2) f(p1x1 + p2x2),

and its true in view of the definition of a convex function.

If f is strictly convex, equality is iff p1 or p2 is zero, i.e. if X is

deterministic.

Now lets assume the inequality holds for L = k 1. We needto show that it then holds for L = k. Let pi, i = 1, 2, , k be theprobabilities for a k-valued random variable and define pi = pi/(1 pk), i = 1, 2, , k 1. Clearly

i pi = 1 and the p

i are thus the

probabilities of some (k 1)-valued random variable. We haveki=1

pif(xi) = pkf(xk) + (1 pk)k1i=1

pif(xi)

pkf(xk) + (1 pk)f(

k1i=1

pif(xi)

)

f(pkxk + (1 pk)

k1i=1

pif(xi)

)

= f

(ki=1

pixi

).

For the equality proof, note that equality in the first inequality above

13

is when all pi are zero but one and in the last inequality if it or pk are

zero, i.e. X is deterministic.

Theorem 5

D(pq) 0.

Equality above if and only if p(x) = q(x).

Proof:

D(pq) = Ep[log

p(x)

q(x)

]= Ep

[log

q(x)

p(x)

] logEp

[q(x)

p(x)

]= 0.

Since the log function is strictly convex, equality above is if and only if

p(x)/q(x) = c, c a constant, i.e. p(x) = cq(x). Summing over x on

both sides, c = 1, i.e. p(x) = q(x).

Corollary 1

I(X;Y ) 0.

Proof:

I(X;Y ) = D(p(x, y)p(x)p(y)) 0,

with equality iff p(x, y) = p(x)p(y), i.e. X and Y are independent.

14

Theorem 6 Let X be the range of random variable X and |X | thecardinality of X . Then

H(X) log |X |

with equality iff X is uniformly distributed over X .Proof: Let u(x) = 1/|X |, x X and p(x) the probability mass functionof X. We have

D(pu) =x

p(x) logp(x)

u(x)= log |X | H(X) 0.

Theorem 7 Conditioning reduces entropy:

H(X) H(X|Y )

with equality iff X and Y are independent.

Proof:

I(X;Y ) = H(X)H(X|Y ) 0 H(X) H(X|Y ),

with equality if X and Y are independent.

Theorem 8

H(X1, X2, , Xn) ni=1

H(Xi),

with equality iff the Xi are independent.

Proof: From the chain rule of entropy,

H(X1, X2, , Xn) =ni=1

H(Xi|X1, X2, , Xi1) ni=1

H(Xi),

with equality iff the Xi are independent, by Theorem 7.

15

Theorem 9 (The log-sum inequality) For non-negative numbers ai, i =

1, 2, , n and bi, i = 1, 2, , nni=1

ai logaibi(

ni=1

ai

)log

ni=1 aini=1 bi

.

Proof: Let A =n

i=1 ai, B =n

i=1 bi, ai = ai/A and b

i = bi/B. Then

ni=1

ai logaibi

= Ani=1

ai logaiAbiB

= Ani=1

ai logaibi+ A log

A

B

= AD(ab) + A log AB

A log AB.

Theorem 10 D(pq) is a convex function of the pair of probabilitymass functions p and q. In other words, if (p1, q1) and (p2, q2) are two

pairs of probability mass functions, then

D(p1q1) + (1 )D(p2q2) D(p1 + (1 )p2q1 + (1 )q2).

Proof: By the log-sum inequality,

p1(x) logp1(x)

q1(x)+ (1 )p2(x) log (1 )p2(x)

(1 )q2(x) (p1(x) + (1 )p2(x)) log (p1(x) + (1 )p2(x))

(q1(x) + (1 )q2(x)) .

Summing over all x yields the desired property.

Theorem 11 (Concavity of entropy): H(p) is a concave function of p.

16

Proof: We can write

H(p) = log |X | D(pu).

Since the relative entropy is convex (by the previous theorem), it follows

that the entropy is concave.

An interesting alternative proof is as follows: Let X1 be a random

variable with distribution p1 taking values from a set A and X2 another

random variable with distribution p2 taking values from the same set A.

Moreover, let be a binary random variable with

=

1, with probability

2, with probability (1 ).

Now let Z = X. Then, Z takes values from A with probability distrib-

ution

p(Z) = p1 + (1 )p2.

Thus,

H(Z) = H(p1 + (1 )p2).

On the other hand,

H(Z|) = H(p1) + (1 )H(p2).

Since conditioning reduces entropy, we have

H(Z) H(Z|) H(p1 + (1 )p2) H(p1) + (1 )H(p2),

which proves H(p) is concave in p.

17

Exercise 2 Consider two containers C1 and C2 (shown below) con-

taining n1 and n2 molecules, respectively. The energies of the n1 mole-

cules in C1 are i.i.d. random variables with common distribution p1.

Similarly, the energies of the molecules in C2 are i.i.d. with common

distribution p2.

1. Find the entropies of the ensembles in C1 and C2.

2. Now assume that the separation between the two containers is re-

moved. Find the entropy of the mixture and show it is greater than

the sum of the individual entropies in part a).

Solution: Let Xi, i = 1, 2, , n1 be the i.i.d. random variables asso-ciated with the energies in container C1 and Yi, i = 1, 2, , n2 thosefor container C2. Then,

1.

H(X1, X2, , Xn1) =n1i=1

H(Xi) = n1H(p1).

Similarly,

H(Y1, Y2, , Xn2) =n2i=1

H(Yi) = n2H(p2).

2. After the separation is removed, let Zi, i = 1, 2, , (n1 + n2) bethe random energies associated with the molecules in the mixture.

The (n1 + n2) random variables are still i.i.d. with a common

distribution given by

p(Zi) =n1

n1 + n2p1 +

n2n1 + n2

p2.

18

Thus,

H(Z1, Z2, , Z(n1+n2)) = (n1 + n2)H(n1

n1 + n2p1 +

n2n1 + n2

p2)

(n1 + n2){

n1n1 + n2

H(p1) +n2

n1 + n2H(p2)

}= n1H(p1) + n2H(p2),

where the inequality above is due to the concavity of the entropy.

Theorem 12 Let (X, Y ) have joint distribution p(x, y) = p(x)p(y|x).The mutual information I(X;Y ) is a concave function of p(x) for fixed

p(y|x) and a convex function of p(y|x) for a fixed p(x).

2.5 The Data Processing Theorem

Definition 12 Three random variables X,Y, and Z form a Markov

chain in that order, denoted X Y Z, if, conditioned on Y , Z isindependent of X. In this case

p(x, y, z) = p(x)p(y|x)p(z|x, y) = p(x)p(y|x)p(z|y)

or

p(z|x, y) = p(z|y) p(z, x|y) = p(z|y)p(x|y).

Theorem 13 (Data Processing theorem): Consider the Markov chain

X Y Z. ThenI(X;Y ) I(X;Z).

19

Proof: We have

I(X;Y Z) = I(X;Y ) + I(X;Z|Y )= I(X;Z) + I(X;Y |Z).

However, I(X;Z|Y ) = 0 since X and Z and independent given Y .Thus,

I(X;Y ) = I(X;Z) + I(X;Y |Z) I(X;Z).

Equality above is when I(X;Y |Z) = 0, i.e. when X and Y are inde-pendent given Z, or, in other words, when we have a Markov chain

X Z Y .

Corollary 2 If Z = g(Y ), then I(X;Y ) I(X; g(Y )).Proof: X Y g(Y ) is a Markov chain.

2.6 Fanos Inequality

Consider the simple communication system depicted in Figure 3. We

are interested in estimating X from the received data Y . Towards this

end, Y is processed by some function g to obtain an estimate X of

X. In digital communications we are interested in estimating X in as

error-free a fashion as possible. Thus, we are interested in minimizing

the probability of error,

Pe = Pr[X 6= X].

Fanos inequality relates the probability of error to the H(X|Y ).

20

P(y|x) g( )X Y X

^

Figure 3: Figure for Fanos Inequality.

Theorem 14 (Fanos Inequality:)

H(Pe) + Pe log(|X | 1) H(X|Y ).

A somewhat weaker inequality is

1 + Pe log |X | H(X|Y )

or

Pe H(X|Y ) 1log |X | .

Proof:

Let

E =

1 if X 6= X

0 if X = X.

Clearly, p(E = 1) = Pe and p(E = 0) = 1 Pe. Then

H(E,X|Y ) = H(X|Y ) +H(E|X, Y )= H(E|Y ) +H(X|E, Y ).

Since Y induces X with no uncertainty and X and X determine E,

H(E|X,Y ) = 0. Thus,

H(X|Y ) = H(E|Y ) +H(X|E, Y ) H(Pe) +H(X|E, Y ). (1)

21

Now,

H(X|E, Y ) = Pr(E = 0)H(X|E = 0, Y ) 0

+Pr(E = 1)H(X|E = 1, Y )

= Pr(E = 1)H(X|E = 1, Y ) Pe log(|X | 1). (2)

Combining (1) and (2) we obtain the desired bound.

The weaker bound is obtained easily by bounding H(Pe) from above

by 1 and replacing |X | 1 by |X |.

2.7 Exercises for Chapter 2

Exercise 3 Consider the Z-channel discussed in Example 1. Com-

pute and plot the mutual information between the input and output of

the channel.

Solution: We have

I(X;Y ) =x

y

p(x, y) logp(x|y)p(x)

=x

y

p(y|x)p(x) log p(y|x)p(y)

= 1 1 + 2

log(1 + ) +

2log .

Exercise 4 The channel in Figure 5 below is known as the binary sym-

metric channel (BSC) and it is another simple model for a channel.

22

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

I(X;Y

)

Figure 4: Mutual information between input and output for the z-channel.

1-p

1p

p

0.5 0

10.5

X Y

0

1

p

Figure 5: The binary symmetric channel.

Compute the mutual information between the input X and the output

Y as a function of the cross-over probability p.

We have

I(X;Y ) =x

y

p(y|x)p(x) log p(y|x)p(y)

Now,

p(Y = 0) =1

2(1 p) + 1

2p =

1

2

p(Y = 1) =1

2

23

Thus,

I(X;Y ) = (1 p) log[2(1 p)] + p log(2p)= 1 + p log p+ (1 p) log(1 p)= 1 hb(p),

where

hb(p) = p log p (1 p) log(1 p)

is the binary entropy function.

A plot of the mutual information as a function of p is given below:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 6: The mutual information between input and output for a BSC.

Exercise 5 Show that if X is a function of Y then H(Y ) H(X).Solution: By the corollary to the data processing theorem,

I(Y ;Y ) I(Y ;X) H(Y ) H(X)H(X|Y ) = H(X).

24

Another solution:

H(Y ) = H(X,Y )H(X|Y ) 0

= H(X) +H(Y |X) H(X).

Exercise 6 Consider the simple rate 2/3 parity-check code where the

third bit, known as the parity bit, is the exclusive or of the first two bits:

000

011

101

110

Let the first two (information bits) be defined by the random vector

X = (X1, X2) and the parity-bit by random variable Y .

1. How much uncertainty is resolved about what X is by observing

Y ?

2. How much uncertainty about X is resolved by observing Y and

X2?

3. Now suppose the parity bit Y is observed through a BSC with a

cross-over probability that produces an output Z. How much

uncertainty is resolved about X by observing Z?

Solution:

25

Documents

Entropy Basics