Upload
mnpopat4867
View
238
Download
0
Embed Size (px)
DESCRIPTION
Entropy Basic equations, mutual information and relation.
Citation preview
2 Entropy and Mutual Information
From the examples on information given earlier, it seems clear that
the information, i(A), provided by the occurrence of an event A should
have the following properties:
1. i(A) must be a monotonically decreasing function of its probabil-
ity:
i(A) = f [p(A)]
where f() is monotonically decreasing with p(A).
2. i(A) 0 for 0 p(A) 1.
3. i(A) = 0 if p(A) = 1.
4. i(A) > i(B) if p(A) < p(B).
5. If A and B are independent events, then i(AB) = i(AB) =
i(A) + i(B).
It turns out there is one and only one function that satisfies the
above requirements, namely the logarithmic function.
Definition 1 Let A be an event having probability of occurrence p(A).
Then the amount of information conveyed by the knowledge of the oc-
currence of A, referred to as the self-information of event A, is given
by
i(A) = log2[p(A)] bits.
1
The unit of information is in bits when the base of the logarithm
is 2, nats when the the base is e = 2.71 and nepers when a base10 logarithm is used.
Definition 2 (Mutual information) The amount of information pro-
vided by the occurrence of an event B about the occurrence of an event
A, known as the mutual information between A and B, is defined by1
i(A;B) = log
[p(A|B)p(A)
]= log p(A) [ log p(A|B)]= i(A) i(A|B).
In other words, i(A;B) is also the amount of uncertainty about event
A removed by the occurrence of event B.
2.1 Properties of Mutual Information
1.
i(A;B) = i(B;A).
Proof:
i(A;B) = logp(A|B)p(A)
= logp(A,B)
p(A)p(B)
= logp(B|A)p(B)
= I(B;A).
1Here, we abuse notation somewhat by using i() to denote self-information and i( ; ) to denotemutual information.
2
2. Mutual information
i(A;B) < 0 if p(A|B) < p(A)i(A;B) > 0 if p(A|B) > p(A)i(A;B) = 0 if p(A|B) = p(A).
3. i(A;A) = log p(A) = i(A).
Definition 3 The mutual information between events A and B given
that event C has occurred is
i(A;B|C) = log p(A|B,C)p(A|C) = log
p(A,B|C)p(A|C)p(B|C) = i(B;A|C).
Exercise 1 Prove the following chain rule for mutual information:
i(A;B,C) = i(A;B) + i(A;C|B).
Proof: We have
i(A;B,C) = logp(A|B,C)p(A)
.
Now,
p(A,B,C) = p(A|B,C)p(C|B)p(B)= p(C|A,B)p(A|B)p(B)
p(A|B,C) = p(C|A,B)p(A|B)p(C|B) .
3
Thus,
i(A;B,C) = log
{p(A|B)p(A)
p(C|A,Bp(C|B)
}= log
p(A|B)p(A)
+ logp(C|A,B)p(C|B)
= i(A;B) + i(C;A|B)= i(A;B) + i(A;C|B).
Thus, the amount of information about event A provided by the joint
occurrence of events B and C is equal to the amount of information
provided about A by the occurrence of B plus the amount of informa-
tion provided about A by the occurrence of C given that B has already
occurred.
Example 1 Consider flipping a fair coin and providing information as
to the outcome through the following so-called Z-channel or erasure
channel. We have the following mapping from the coin outcome (H
or T) to the real field: H 0 and T 1. Let X be the binary-valuedrandom variable associated with the coin flipping.
1
1
0.5 0
10.5
X Y
0
1
Figure 1: The binary erasure channel for Example 1.
1. What is the self-information of the events X = 0, X = 1?
4
We have 2:
i(X = 0) = log p(X = 0) = log(1
2
)= 1 bit.
i(X = 1) = log p(X = 1) = log(12) = 1 bit.
2. What is the self-information of events Y = 0, Y = 1?
We have:
i(Y = 0) = log p(Y = 0).
p(Y = 0) = p(Y = 0|X = 0)p(X = 0) + p(Y = 0|X = 1)p(X = 1)= 1 p(X = 0) + p(X = 1)=
1 +
2.
i(Y = 0) = log 1 + 2
bits.
Similarly,
i(Y = 1) = log p(Y = 1).
p(Y = 1) = 1 p(Y = 0) = 1 2
.
i(Y = 1) = log 1 2
bits.
3. What is the mutual information between events X = 0 and Y = 0?
We have,
i(Y = 0;X = 0) = logp(Y = 0|X = 0)
p(Y = 0)= log
2
1 + = 1log(1+)
2Unless otherwise stated, log will denote logarithm based 2.
5
4. What is the mutual information between events X = 0 and Y = 1?
i(Y = 1;X = 0) = logp(Y = 1|X = 0)
p(Y = 1)= log(0) = .
5. What is the mutual information between events X = 1 and Y = 0?
i(Y = 0;X = 1) = logp(Y = 0|X = 1)
p(Y = 0)= log
2
1 +
6. What is the mutual information between X = 0 and Y = 0?
i(Y = 1;X = 0) = logp(Y = 1|X = 0)
p(Y = 1)= log
2(1 )1 = 1bit
2.2 Average Self-Information -Entropy
Definition 4 (Entropy) The entropy of a source represented by a ran-
dom variable X with realizations taking values from the set X is
H(X) = xX
p(x) log p(x).
From the above definition, we can write
H(X) = E[ log p(X)] = E[i(X)],
which implies that the entropy of a source is the average amount of
information produced by the source.
Clearly, since log p(x) 0, we have H(X) 0.
6
Example 2 Consider a source X that produces two symbols with equal
probability (1/2). The entropy of this source is
H(X) = 12log
(1
2
) 12log
(1
2
)= 1 bit
Example 3 Consider the tossing of a fair dice ( each outcome occurs
with equal probability 1/6). The average amount of information pro-
duced by this source is
H(X) = 16log
(1
6
) 6 = log(6) = 2.585 bits.
Note that the source in Example 3 produces more information on
average than the source in Example 2.
Definition 5 The joint entropy, H(X,Y ), of two discrete random vari-
ables X and Y is
H(X, Y ) = EX,Y [log p(X,Y )] = x
y
p(x, y) log p(x, y).
Definition 6 The conditional entropy, of Y given X is
H(Y |X) = EX,Y [log p(Y |X)] = x
y
p(x, y) log p(y|x).
Theorem 1 (Chain rule for entropy)
H(X, Y ) = H(X) +H(Y |X).
Proof: We have
log p(X,Y ) = log p(Y |X)p(X) = log p(X) log p(Y |X).
7
Taking expectations on both sides above
EX,Y [log p(X, Y )] = EX,Y [log p(X)] EX,Y [log p(Y |X)] H(X, Y ) = H(X) +H(Y |X).
The above result can be easily generalized to the entropy of a random
vector X = (X1, X2, , XN). Let Xn = (X1, X2, , Xn). Then theentropy of X can be expressed as:
H(X) =Nn=1
H(Xn|Xn1),
where we let H(X1|X0) , H(X1).Proof: We can write
p(X) =Nn=1
p(Xn|Xn1),
where we let p(X1|X0) , p(X1). Thus,
log p(X) = Nn=1
log p(Xn|Xn1).
Taking expectations with respect to the joint probability mass function
p(X) on both sides of the equation, we obtain the desired relation.
2.3 Relative Entropy and Mutual Information
Definition 7 The relative entropy or Kullback-Leibler distance between
two probability mass functions p(x) and q(x) is given by
D(pq) = Ep[log
p(x)
q(x)
]=x
p(x) logp(x)
q(x).
8
The relative entropy is a measure of the distance between the two dis-
tributions p(x) and q(x), even though it is not s true distance metric.
We defined earlier the mutual information between two events. Now
consider two random variables X X and Y Y . If x X is arealization of X and y Y is a realization of Y , the mutual informationbetween x and y is
i(x; y) = logp(x|y)p(x)
,
which is obviously a random variable over the ensemble of realizations
of X and Y . We have the following definition.
Definition 8 The mutual information between two discrete random
variables X and Y is
I(X;Y ) = E
[log
p(X|Y )p(X)
]=x
y
p(x, y) logp(x|y)p(x)
=x
y
p(x, y) logp(x, y)
p(x)p(y)
= D(p(x, y)p(x)p(y))=x
y
p(x, y) logp(y|x)p(y)
= I(Y ;X).
The mutual information between random variables X and Y is the
average amount of information provided aboutX by observing Y , which
is also the average amount of uncertainty resolved about X by observ-
ing Y . As can be seen, I(X;Y ) = I(Y ;X), i.e. Y resolves as much
9
uncertainty about X as X about Y .
We have the following relations between entropy and mutual infor-
mation:
1.
I(X;Y ) = H(X)H(X|Y ) = H(Y )H(Y |X).
Proof: We have
I(X;Y ) = E logp(X|Y )p(X
= E log p(X) E log p(X|Y )= H(X)H(X|Y ).
Since, as established earlier I(X;Y ) = I(Y ;X), we have
I(X;Y ) = H(Y )H(Y |X).
2.
I(X;Y ) = H(X)H(X|Y )= H(X) [H(X, Y )H(Y )]= H(X) +H(Y )H(X, Y ).
3. I(X;X) = H(X)H(X|X) = H(X).
The diagram in Figure 2 summarizes the relationship between the
various quantities:
10
Figure 2: Mutual information between input and output for the z-channel.
Definition 9 The conditional mutual information of discrete random
variables X and Y given Z is
I(X;Y |Z) = EXY Z[log
p(X|Y, Z)p(X|Z)
]= H(X|Z)H(X|Y, Z).
Theorem 2 (Chain rule for mutual information):
Let X = (X1, X2, , XN) be a random vector. Then the mutual in-formation between X and Y is
I(X;Y ) =Nn=1
I(Xn;Y |Xn1
).
Proof:
11
I(X;Y ) = H(X)H(X|Y )
=Nn=1
H(Xn|Xn1)Nn=1
H(Xn|Xn1, Y )
=Nn=1
[H(Xn|Xn1)H(Xn|Xn1, Y )
]=
Nn=1
I(Xn;Y |Xn1
).
2.4 Jensens Inequality
Definition 10 A function f(x) is convex over an interval (a, b) if for
all x1, x2 (a, b) and 0 1
f(x1 + (1 )x2) f(x1) + (1 )f(x2).
The function is said to be strictly convex if equality above holds only
if = 0 or = 1.
Definition 11 A function f(x) is said to be concave if f(x) is con-vex.
Theorem 3 A function f that has a non-negative (positive) second
derivative is convex (strictly convex).
Theorem 4 (Jensens inequality) Let f be a convex function and X a
random variable. Then
E[f(X)] f (E(X)) .12
If f is strictly convex, equality is if and only if X is a constant (not
random).
Proof: (by induction)
Let L be the number of values the discrete random variable X can take.
For a binary random variable, i.e. L = 2, with probabilities p1 and p2
the inequality becomes
p1f(x1) + p2f(x2) f(p1x1 + p2x2),
and its true in view of the definition of a convex function.
If f is strictly convex, equality is iff p1 or p2 is zero, i.e. if X is
deterministic.
Now lets assume the inequality holds for L = k 1. We needto show that it then holds for L = k. Let pi, i = 1, 2, , k be theprobabilities for a k-valued random variable and define pi = pi/(1 pk), i = 1, 2, , k 1. Clearly
i pi = 1 and the p
i are thus the
probabilities of some (k 1)-valued random variable. We haveki=1
pif(xi) = pkf(xk) + (1 pk)k1i=1
pif(xi)
pkf(xk) + (1 pk)f(
k1i=1
pif(xi)
)
f(pkxk + (1 pk)
k1i=1
pif(xi)
)
= f
(ki=1
pixi
).
For the equality proof, note that equality in the first inequality above
13
is when all pi are zero but one and in the last inequality if it or pk are
zero, i.e. X is deterministic.
Theorem 5
D(pq) 0.
Equality above if and only if p(x) = q(x).
Proof:
D(pq) = Ep[log
p(x)
q(x)
]= Ep
[log
q(x)
p(x)
] logEp
[q(x)
p(x)
]= 0.
Since the log function is strictly convex, equality above is if and only if
p(x)/q(x) = c, c a constant, i.e. p(x) = cq(x). Summing over x on
both sides, c = 1, i.e. p(x) = q(x).
Corollary 1
I(X;Y ) 0.
Proof:
I(X;Y ) = D(p(x, y)p(x)p(y)) 0,
with equality iff p(x, y) = p(x)p(y), i.e. X and Y are independent.
14
Theorem 6 Let X be the range of random variable X and |X | thecardinality of X . Then
H(X) log |X |
with equality iff X is uniformly distributed over X .Proof: Let u(x) = 1/|X |, x X and p(x) the probability mass functionof X. We have
D(pu) =x
p(x) logp(x)
u(x)= log |X | H(X) 0.
Theorem 7 Conditioning reduces entropy:
H(X) H(X|Y )
with equality iff X and Y are independent.
Proof:
I(X;Y ) = H(X)H(X|Y ) 0 H(X) H(X|Y ),
with equality if X and Y are independent.
Theorem 8
H(X1, X2, , Xn) ni=1
H(Xi),
with equality iff the Xi are independent.
Proof: From the chain rule of entropy,
H(X1, X2, , Xn) =ni=1
H(Xi|X1, X2, , Xi1) ni=1
H(Xi),
with equality iff the Xi are independent, by Theorem 7.
15
Theorem 9 (The log-sum inequality) For non-negative numbers ai, i =
1, 2, , n and bi, i = 1, 2, , nni=1
ai logaibi(
ni=1
ai
)log
ni=1 aini=1 bi
.
Proof: Let A =n
i=1 ai, B =n
i=1 bi, ai = ai/A and b
i = bi/B. Then
ni=1
ai logaibi
= Ani=1
ai logaiAbiB
= Ani=1
ai logaibi+ A log
A
B
= AD(ab) + A log AB
A log AB.
Theorem 10 D(pq) is a convex function of the pair of probabilitymass functions p and q. In other words, if (p1, q1) and (p2, q2) are two
pairs of probability mass functions, then
D(p1q1) + (1 )D(p2q2) D(p1 + (1 )p2q1 + (1 )q2).
Proof: By the log-sum inequality,
p1(x) logp1(x)
q1(x)+ (1 )p2(x) log (1 )p2(x)
(1 )q2(x) (p1(x) + (1 )p2(x)) log (p1(x) + (1 )p2(x))
(q1(x) + (1 )q2(x)) .
Summing over all x yields the desired property.
Theorem 11 (Concavity of entropy): H(p) is a concave function of p.
16
Proof: We can write
H(p) = log |X | D(pu).
Since the relative entropy is convex (by the previous theorem), it follows
that the entropy is concave.
An interesting alternative proof is as follows: Let X1 be a random
variable with distribution p1 taking values from a set A and X2 another
random variable with distribution p2 taking values from the same set A.
Moreover, let be a binary random variable with
=
1, with probability
2, with probability (1 ).
Now let Z = X. Then, Z takes values from A with probability distrib-
ution
p(Z) = p1 + (1 )p2.
Thus,
H(Z) = H(p1 + (1 )p2).
On the other hand,
H(Z|) = H(p1) + (1 )H(p2).
Since conditioning reduces entropy, we have
H(Z) H(Z|) H(p1 + (1 )p2) H(p1) + (1 )H(p2),
which proves H(p) is concave in p.
17
Exercise 2 Consider two containers C1 and C2 (shown below) con-
taining n1 and n2 molecules, respectively. The energies of the n1 mole-
cules in C1 are i.i.d. random variables with common distribution p1.
Similarly, the energies of the molecules in C2 are i.i.d. with common
distribution p2.
1. Find the entropies of the ensembles in C1 and C2.
2. Now assume that the separation between the two containers is re-
moved. Find the entropy of the mixture and show it is greater than
the sum of the individual entropies in part a).
Solution: Let Xi, i = 1, 2, , n1 be the i.i.d. random variables asso-ciated with the energies in container C1 and Yi, i = 1, 2, , n2 thosefor container C2. Then,
1.
H(X1, X2, , Xn1) =n1i=1
H(Xi) = n1H(p1).
Similarly,
H(Y1, Y2, , Xn2) =n2i=1
H(Yi) = n2H(p2).
2. After the separation is removed, let Zi, i = 1, 2, , (n1 + n2) bethe random energies associated with the molecules in the mixture.
The (n1 + n2) random variables are still i.i.d. with a common
distribution given by
p(Zi) =n1
n1 + n2p1 +
n2n1 + n2
p2.
18
Thus,
H(Z1, Z2, , Z(n1+n2)) = (n1 + n2)H(n1
n1 + n2p1 +
n2n1 + n2
p2)
(n1 + n2){
n1n1 + n2
H(p1) +n2
n1 + n2H(p2)
}= n1H(p1) + n2H(p2),
where the inequality above is due to the concavity of the entropy.
Theorem 12 Let (X, Y ) have joint distribution p(x, y) = p(x)p(y|x).The mutual information I(X;Y ) is a concave function of p(x) for fixed
p(y|x) and a convex function of p(y|x) for a fixed p(x).
2.5 The Data Processing Theorem
Definition 12 Three random variables X,Y, and Z form a Markov
chain in that order, denoted X Y Z, if, conditioned on Y , Z isindependent of X. In this case
p(x, y, z) = p(x)p(y|x)p(z|x, y) = p(x)p(y|x)p(z|y)
or
p(z|x, y) = p(z|y) p(z, x|y) = p(z|y)p(x|y).
Theorem 13 (Data Processing theorem): Consider the Markov chain
X Y Z. ThenI(X;Y ) I(X;Z).
19
Proof: We have
I(X;Y Z) = I(X;Y ) + I(X;Z|Y )= I(X;Z) + I(X;Y |Z).
However, I(X;Z|Y ) = 0 since X and Z and independent given Y .Thus,
I(X;Y ) = I(X;Z) + I(X;Y |Z) I(X;Z).
Equality above is when I(X;Y |Z) = 0, i.e. when X and Y are inde-pendent given Z, or, in other words, when we have a Markov chain
X Z Y .
Corollary 2 If Z = g(Y ), then I(X;Y ) I(X; g(Y )).Proof: X Y g(Y ) is a Markov chain.
2.6 Fanos Inequality
Consider the simple communication system depicted in Figure 3. We
are interested in estimating X from the received data Y . Towards this
end, Y is processed by some function g to obtain an estimate X of
X. In digital communications we are interested in estimating X in as
error-free a fashion as possible. Thus, we are interested in minimizing
the probability of error,
Pe = Pr[X 6= X].
Fanos inequality relates the probability of error to the H(X|Y ).
20
P(y|x) g( )X Y X
^
Figure 3: Figure for Fanos Inequality.
Theorem 14 (Fanos Inequality:)
H(Pe) + Pe log(|X | 1) H(X|Y ).
A somewhat weaker inequality is
1 + Pe log |X | H(X|Y )
or
Pe H(X|Y ) 1log |X | .
Proof:
Let
E =
1 if X 6= X
0 if X = X.
Clearly, p(E = 1) = Pe and p(E = 0) = 1 Pe. Then
H(E,X|Y ) = H(X|Y ) +H(E|X, Y )= H(E|Y ) +H(X|E, Y ).
Since Y induces X with no uncertainty and X and X determine E,
H(E|X,Y ) = 0. Thus,
H(X|Y ) = H(E|Y ) +H(X|E, Y ) H(Pe) +H(X|E, Y ). (1)
21
Now,
H(X|E, Y ) = Pr(E = 0)H(X|E = 0, Y ) 0
+Pr(E = 1)H(X|E = 1, Y )
= Pr(E = 1)H(X|E = 1, Y ) Pe log(|X | 1). (2)
Combining (1) and (2) we obtain the desired bound.
The weaker bound is obtained easily by bounding H(Pe) from above
by 1 and replacing |X | 1 by |X |.
2.7 Exercises for Chapter 2
Exercise 3 Consider the Z-channel discussed in Example 1. Com-
pute and plot the mutual information between the input and output of
the channel.
Solution: We have
I(X;Y ) =x
y
p(x, y) logp(x|y)p(x)
=x
y
p(y|x)p(x) log p(y|x)p(y)
= 1 1 + 2
log(1 + ) +
2log .
Exercise 4 The channel in Figure 5 below is known as the binary sym-
metric channel (BSC) and it is another simple model for a channel.
22
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
I(X;Y
)
Figure 4: Mutual information between input and output for the z-channel.
1-p
1p
p
0.5 0
10.5
X Y
0
1
p
Figure 5: The binary symmetric channel.
Compute the mutual information between the input X and the output
Y as a function of the cross-over probability p.
We have
I(X;Y ) =x
y
p(y|x)p(x) log p(y|x)p(y)
Now,
p(Y = 0) =1
2(1 p) + 1
2p =
1
2
p(Y = 1) =1
2
23
Thus,
I(X;Y ) = (1 p) log[2(1 p)] + p log(2p)= 1 + p log p+ (1 p) log(1 p)= 1 hb(p),
where
hb(p) = p log p (1 p) log(1 p)
is the binary entropy function.
A plot of the mutual information as a function of p is given below:
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 6: The mutual information between input and output for a BSC.
Exercise 5 Show that if X is a function of Y then H(Y ) H(X).Solution: By the corollary to the data processing theorem,
I(Y ;Y ) I(Y ;X) H(Y ) H(X)H(X|Y ) = H(X).
24
Another solution:
H(Y ) = H(X,Y )H(X|Y ) 0
= H(X) +H(Y |X) H(X).
Exercise 6 Consider the simple rate 2/3 parity-check code where the
third bit, known as the parity bit, is the exclusive or of the first two bits:
000
011
101
110
Let the first two (information bits) be defined by the random vector
X = (X1, X2) and the parity-bit by random variable Y .
1. How much uncertainty is resolved about what X is by observing
Y ?
2. How much uncertainty about X is resolved by observing Y and
X2?
3. Now suppose the parity bit Y is observed through a BSC with a
cross-over probability that produces an output Z. How much
uncertainty is resolved about X by observing Z?
Solution:
25
1. We need to compute I(X;Y ). We have
I(X;Y ) = H(X)H(X|Y ).
Clearly,
H(X) = 2 bits
Now,
H(X|Y ) = H(X1|Y ) +H(X2|X1, Y ) 0
= Pr(Y = 0)H(X1|Y = 0) + Pr(Y = 1)H(X1|Y = 1)=
1
2H(X1|Y = 0) + 1
2H(X1|Y = 1)
= H(X1|Y = 0) = 1
Thus,
I(X;Y ) = 2 1 = 1 bit.
2.
I(X;Y,X2) = H(X)H(X|Y,X2) 0
= 2 bits
3. We have
I(X;Z, Y ) = I(X;Z) + I(X;Y |Z)= I(X;Y ) + I(X;Z|Y )
0
I(X;Z) = I(X;Y ) I(X;Y |Z).
Now,
I(X;Y |Z) = 12I(X;Y |Z = 0) + 1
2I(X;Y |Z = 1)
= I(X;Y |Z = 0),26
since p(Z = 0) = p(Z = 1) = 1/2 and, due to symmetry, it can be
argued I(X;Y |Z = 0) = I(X;Y |Z = 1) (verify this). Then,
I(X;Y |Z = 0) =x
1y=0
p(X = x, Y = y|Z = 0)
logp(X = x|Y = y, Z = 0)
p(X = x|Z = 0)
=x
1y=0
p(Y = y|X = x, Z = 0)p(X = x|Z = 0)
logp(X = x|Y = y)p(X = x|Z = 0)
=x
1y=0
p(Y = y|X = x)p(X = x|Z = 0)
logp(Y = y|X = x)p(X = x)p(X = x|Z = 0)p(Y = y)
=x
p(X = x|Z = 0) log 12p(X = x|Z = 0)
= hb().
Thus,
I(X;Z) = 1 hb().
27