27
2 Entropy and Mutual Information From the examples on “information” given earlier, it seems clear that the information, i(A), provided by the occurrence of an event A should have the following properties: 1. i(A) must be a monotonically decreasing function of its probabil- ity: i(A)= f [p(A)] where f (·) is monotonically decreasing with p(A). 2. i(A) 0 for 0 p(A) 1. 3. i(A) = 0 if p(A) = 1. 4. i(A) >i(B) if p(A) <p(B). 5. If A and B are independent events, then i(A T B)= i(AB)= i(A)+ i(B). It turns out there is one and only one function that satisfies the above requirements, namely the logarithmic function. Definition 1 Let A be an event having probability of occurrence p(A). Then the amount of information conveyed by the knowledge of the oc- currence of A, referred to as the self-information of event A, is given by i(A)= - log 2 [p(A)] bits. 1

Entropy Basics

Embed Size (px)

DESCRIPTION

Entropy Basic equations, mutual information and relation.

Citation preview

  • 2 Entropy and Mutual Information

    From the examples on information given earlier, it seems clear that

    the information, i(A), provided by the occurrence of an event A should

    have the following properties:

    1. i(A) must be a monotonically decreasing function of its probabil-

    ity:

    i(A) = f [p(A)]

    where f() is monotonically decreasing with p(A).

    2. i(A) 0 for 0 p(A) 1.

    3. i(A) = 0 if p(A) = 1.

    4. i(A) > i(B) if p(A) < p(B).

    5. If A and B are independent events, then i(AB) = i(AB) =

    i(A) + i(B).

    It turns out there is one and only one function that satisfies the

    above requirements, namely the logarithmic function.

    Definition 1 Let A be an event having probability of occurrence p(A).

    Then the amount of information conveyed by the knowledge of the oc-

    currence of A, referred to as the self-information of event A, is given

    by

    i(A) = log2[p(A)] bits.

    1

  • The unit of information is in bits when the base of the logarithm

    is 2, nats when the the base is e = 2.71 and nepers when a base10 logarithm is used.

    Definition 2 (Mutual information) The amount of information pro-

    vided by the occurrence of an event B about the occurrence of an event

    A, known as the mutual information between A and B, is defined by1

    i(A;B) = log

    [p(A|B)p(A)

    ]= log p(A) [ log p(A|B)]= i(A) i(A|B).

    In other words, i(A;B) is also the amount of uncertainty about event

    A removed by the occurrence of event B.

    2.1 Properties of Mutual Information

    1.

    i(A;B) = i(B;A).

    Proof:

    i(A;B) = logp(A|B)p(A)

    = logp(A,B)

    p(A)p(B)

    = logp(B|A)p(B)

    = I(B;A).

    1Here, we abuse notation somewhat by using i() to denote self-information and i( ; ) to denotemutual information.

    2

  • 2. Mutual information

    i(A;B) < 0 if p(A|B) < p(A)i(A;B) > 0 if p(A|B) > p(A)i(A;B) = 0 if p(A|B) = p(A).

    3. i(A;A) = log p(A) = i(A).

    Definition 3 The mutual information between events A and B given

    that event C has occurred is

    i(A;B|C) = log p(A|B,C)p(A|C) = log

    p(A,B|C)p(A|C)p(B|C) = i(B;A|C).

    Exercise 1 Prove the following chain rule for mutual information:

    i(A;B,C) = i(A;B) + i(A;C|B).

    Proof: We have

    i(A;B,C) = logp(A|B,C)p(A)

    .

    Now,

    p(A,B,C) = p(A|B,C)p(C|B)p(B)= p(C|A,B)p(A|B)p(B)

    p(A|B,C) = p(C|A,B)p(A|B)p(C|B) .

    3

  • Thus,

    i(A;B,C) = log

    {p(A|B)p(A)

    p(C|A,Bp(C|B)

    }= log

    p(A|B)p(A)

    + logp(C|A,B)p(C|B)

    = i(A;B) + i(C;A|B)= i(A;B) + i(A;C|B).

    Thus, the amount of information about event A provided by the joint

    occurrence of events B and C is equal to the amount of information

    provided about A by the occurrence of B plus the amount of informa-

    tion provided about A by the occurrence of C given that B has already

    occurred.

    Example 1 Consider flipping a fair coin and providing information as

    to the outcome through the following so-called Z-channel or erasure

    channel. We have the following mapping from the coin outcome (H

    or T) to the real field: H 0 and T 1. Let X be the binary-valuedrandom variable associated with the coin flipping.

    1

    1

    0.5 0

    10.5

    X Y

    0

    1

    Figure 1: The binary erasure channel for Example 1.

    1. What is the self-information of the events X = 0, X = 1?

    4

  • We have 2:

    i(X = 0) = log p(X = 0) = log(1

    2

    )= 1 bit.

    i(X = 1) = log p(X = 1) = log(12) = 1 bit.

    2. What is the self-information of events Y = 0, Y = 1?

    We have:

    i(Y = 0) = log p(Y = 0).

    p(Y = 0) = p(Y = 0|X = 0)p(X = 0) + p(Y = 0|X = 1)p(X = 1)= 1 p(X = 0) + p(X = 1)=

    1 +

    2.

    i(Y = 0) = log 1 + 2

    bits.

    Similarly,

    i(Y = 1) = log p(Y = 1).

    p(Y = 1) = 1 p(Y = 0) = 1 2

    .

    i(Y = 1) = log 1 2

    bits.

    3. What is the mutual information between events X = 0 and Y = 0?

    We have,

    i(Y = 0;X = 0) = logp(Y = 0|X = 0)

    p(Y = 0)= log

    2

    1 + = 1log(1+)

    2Unless otherwise stated, log will denote logarithm based 2.

    5

  • 4. What is the mutual information between events X = 0 and Y = 1?

    i(Y = 1;X = 0) = logp(Y = 1|X = 0)

    p(Y = 1)= log(0) = .

    5. What is the mutual information between events X = 1 and Y = 0?

    i(Y = 0;X = 1) = logp(Y = 0|X = 1)

    p(Y = 0)= log

    2

    1 +

    6. What is the mutual information between X = 0 and Y = 0?

    i(Y = 1;X = 0) = logp(Y = 1|X = 0)

    p(Y = 1)= log

    2(1 )1 = 1bit

    2.2 Average Self-Information -Entropy

    Definition 4 (Entropy) The entropy of a source represented by a ran-

    dom variable X with realizations taking values from the set X is

    H(X) = xX

    p(x) log p(x).

    From the above definition, we can write

    H(X) = E[ log p(X)] = E[i(X)],

    which implies that the entropy of a source is the average amount of

    information produced by the source.

    Clearly, since log p(x) 0, we have H(X) 0.

    6

  • Example 2 Consider a source X that produces two symbols with equal

    probability (1/2). The entropy of this source is

    H(X) = 12log

    (1

    2

    ) 12log

    (1

    2

    )= 1 bit

    Example 3 Consider the tossing of a fair dice ( each outcome occurs

    with equal probability 1/6). The average amount of information pro-

    duced by this source is

    H(X) = 16log

    (1

    6

    ) 6 = log(6) = 2.585 bits.

    Note that the source in Example 3 produces more information on

    average than the source in Example 2.

    Definition 5 The joint entropy, H(X,Y ), of two discrete random vari-

    ables X and Y is

    H(X, Y ) = EX,Y [log p(X,Y )] = x

    y

    p(x, y) log p(x, y).

    Definition 6 The conditional entropy, of Y given X is

    H(Y |X) = EX,Y [log p(Y |X)] = x

    y

    p(x, y) log p(y|x).

    Theorem 1 (Chain rule for entropy)

    H(X, Y ) = H(X) +H(Y |X).

    Proof: We have

    log p(X,Y ) = log p(Y |X)p(X) = log p(X) log p(Y |X).

    7

  • Taking expectations on both sides above

    EX,Y [log p(X, Y )] = EX,Y [log p(X)] EX,Y [log p(Y |X)] H(X, Y ) = H(X) +H(Y |X).

    The above result can be easily generalized to the entropy of a random

    vector X = (X1, X2, , XN). Let Xn = (X1, X2, , Xn). Then theentropy of X can be expressed as:

    H(X) =Nn=1

    H(Xn|Xn1),

    where we let H(X1|X0) , H(X1).Proof: We can write

    p(X) =Nn=1

    p(Xn|Xn1),

    where we let p(X1|X0) , p(X1). Thus,

    log p(X) = Nn=1

    log p(Xn|Xn1).

    Taking expectations with respect to the joint probability mass function

    p(X) on both sides of the equation, we obtain the desired relation.

    2.3 Relative Entropy and Mutual Information

    Definition 7 The relative entropy or Kullback-Leibler distance between

    two probability mass functions p(x) and q(x) is given by

    D(pq) = Ep[log

    p(x)

    q(x)

    ]=x

    p(x) logp(x)

    q(x).

    8

  • The relative entropy is a measure of the distance between the two dis-

    tributions p(x) and q(x), even though it is not s true distance metric.

    We defined earlier the mutual information between two events. Now

    consider two random variables X X and Y Y . If x X is arealization of X and y Y is a realization of Y , the mutual informationbetween x and y is

    i(x; y) = logp(x|y)p(x)

    ,

    which is obviously a random variable over the ensemble of realizations

    of X and Y . We have the following definition.

    Definition 8 The mutual information between two discrete random

    variables X and Y is

    I(X;Y ) = E

    [log

    p(X|Y )p(X)

    ]=x

    y

    p(x, y) logp(x|y)p(x)

    =x

    y

    p(x, y) logp(x, y)

    p(x)p(y)

    = D(p(x, y)p(x)p(y))=x

    y

    p(x, y) logp(y|x)p(y)

    = I(Y ;X).

    The mutual information between random variables X and Y is the

    average amount of information provided aboutX by observing Y , which

    is also the average amount of uncertainty resolved about X by observ-

    ing Y . As can be seen, I(X;Y ) = I(Y ;X), i.e. Y resolves as much

    9

  • uncertainty about X as X about Y .

    We have the following relations between entropy and mutual infor-

    mation:

    1.

    I(X;Y ) = H(X)H(X|Y ) = H(Y )H(Y |X).

    Proof: We have

    I(X;Y ) = E logp(X|Y )p(X

    = E log p(X) E log p(X|Y )= H(X)H(X|Y ).

    Since, as established earlier I(X;Y ) = I(Y ;X), we have

    I(X;Y ) = H(Y )H(Y |X).

    2.

    I(X;Y ) = H(X)H(X|Y )= H(X) [H(X, Y )H(Y )]= H(X) +H(Y )H(X, Y ).

    3. I(X;X) = H(X)H(X|X) = H(X).

    The diagram in Figure 2 summarizes the relationship between the

    various quantities:

    10

  • Figure 2: Mutual information between input and output for the z-channel.

    Definition 9 The conditional mutual information of discrete random

    variables X and Y given Z is

    I(X;Y |Z) = EXY Z[log

    p(X|Y, Z)p(X|Z)

    ]= H(X|Z)H(X|Y, Z).

    Theorem 2 (Chain rule for mutual information):

    Let X = (X1, X2, , XN) be a random vector. Then the mutual in-formation between X and Y is

    I(X;Y ) =Nn=1

    I(Xn;Y |Xn1

    ).

    Proof:

    11

  • I(X;Y ) = H(X)H(X|Y )

    =Nn=1

    H(Xn|Xn1)Nn=1

    H(Xn|Xn1, Y )

    =Nn=1

    [H(Xn|Xn1)H(Xn|Xn1, Y )

    ]=

    Nn=1

    I(Xn;Y |Xn1

    ).

    2.4 Jensens Inequality

    Definition 10 A function f(x) is convex over an interval (a, b) if for

    all x1, x2 (a, b) and 0 1

    f(x1 + (1 )x2) f(x1) + (1 )f(x2).

    The function is said to be strictly convex if equality above holds only

    if = 0 or = 1.

    Definition 11 A function f(x) is said to be concave if f(x) is con-vex.

    Theorem 3 A function f that has a non-negative (positive) second

    derivative is convex (strictly convex).

    Theorem 4 (Jensens inequality) Let f be a convex function and X a

    random variable. Then

    E[f(X)] f (E(X)) .12

  • If f is strictly convex, equality is if and only if X is a constant (not

    random).

    Proof: (by induction)

    Let L be the number of values the discrete random variable X can take.

    For a binary random variable, i.e. L = 2, with probabilities p1 and p2

    the inequality becomes

    p1f(x1) + p2f(x2) f(p1x1 + p2x2),

    and its true in view of the definition of a convex function.

    If f is strictly convex, equality is iff p1 or p2 is zero, i.e. if X is

    deterministic.

    Now lets assume the inequality holds for L = k 1. We needto show that it then holds for L = k. Let pi, i = 1, 2, , k be theprobabilities for a k-valued random variable and define pi = pi/(1 pk), i = 1, 2, , k 1. Clearly

    i pi = 1 and the p

    i are thus the

    probabilities of some (k 1)-valued random variable. We haveki=1

    pif(xi) = pkf(xk) + (1 pk)k1i=1

    pif(xi)

    pkf(xk) + (1 pk)f(

    k1i=1

    pif(xi)

    )

    f(pkxk + (1 pk)

    k1i=1

    pif(xi)

    )

    = f

    (ki=1

    pixi

    ).

    For the equality proof, note that equality in the first inequality above

    13

  • is when all pi are zero but one and in the last inequality if it or pk are

    zero, i.e. X is deterministic.

    Theorem 5

    D(pq) 0.

    Equality above if and only if p(x) = q(x).

    Proof:

    D(pq) = Ep[log

    p(x)

    q(x)

    ]= Ep

    [log

    q(x)

    p(x)

    ] logEp

    [q(x)

    p(x)

    ]= 0.

    Since the log function is strictly convex, equality above is if and only if

    p(x)/q(x) = c, c a constant, i.e. p(x) = cq(x). Summing over x on

    both sides, c = 1, i.e. p(x) = q(x).

    Corollary 1

    I(X;Y ) 0.

    Proof:

    I(X;Y ) = D(p(x, y)p(x)p(y)) 0,

    with equality iff p(x, y) = p(x)p(y), i.e. X and Y are independent.

    14

  • Theorem 6 Let X be the range of random variable X and |X | thecardinality of X . Then

    H(X) log |X |

    with equality iff X is uniformly distributed over X .Proof: Let u(x) = 1/|X |, x X and p(x) the probability mass functionof X. We have

    D(pu) =x

    p(x) logp(x)

    u(x)= log |X | H(X) 0.

    Theorem 7 Conditioning reduces entropy:

    H(X) H(X|Y )

    with equality iff X and Y are independent.

    Proof:

    I(X;Y ) = H(X)H(X|Y ) 0 H(X) H(X|Y ),

    with equality if X and Y are independent.

    Theorem 8

    H(X1, X2, , Xn) ni=1

    H(Xi),

    with equality iff the Xi are independent.

    Proof: From the chain rule of entropy,

    H(X1, X2, , Xn) =ni=1

    H(Xi|X1, X2, , Xi1) ni=1

    H(Xi),

    with equality iff the Xi are independent, by Theorem 7.

    15

  • Theorem 9 (The log-sum inequality) For non-negative numbers ai, i =

    1, 2, , n and bi, i = 1, 2, , nni=1

    ai logaibi(

    ni=1

    ai

    )log

    ni=1 aini=1 bi

    .

    Proof: Let A =n

    i=1 ai, B =n

    i=1 bi, ai = ai/A and b

    i = bi/B. Then

    ni=1

    ai logaibi

    = Ani=1

    ai logaiAbiB

    = Ani=1

    ai logaibi+ A log

    A

    B

    = AD(ab) + A log AB

    A log AB.

    Theorem 10 D(pq) is a convex function of the pair of probabilitymass functions p and q. In other words, if (p1, q1) and (p2, q2) are two

    pairs of probability mass functions, then

    D(p1q1) + (1 )D(p2q2) D(p1 + (1 )p2q1 + (1 )q2).

    Proof: By the log-sum inequality,

    p1(x) logp1(x)

    q1(x)+ (1 )p2(x) log (1 )p2(x)

    (1 )q2(x) (p1(x) + (1 )p2(x)) log (p1(x) + (1 )p2(x))

    (q1(x) + (1 )q2(x)) .

    Summing over all x yields the desired property.

    Theorem 11 (Concavity of entropy): H(p) is a concave function of p.

    16

  • Proof: We can write

    H(p) = log |X | D(pu).

    Since the relative entropy is convex (by the previous theorem), it follows

    that the entropy is concave.

    An interesting alternative proof is as follows: Let X1 be a random

    variable with distribution p1 taking values from a set A and X2 another

    random variable with distribution p2 taking values from the same set A.

    Moreover, let be a binary random variable with

    =

    1, with probability

    2, with probability (1 ).

    Now let Z = X. Then, Z takes values from A with probability distrib-

    ution

    p(Z) = p1 + (1 )p2.

    Thus,

    H(Z) = H(p1 + (1 )p2).

    On the other hand,

    H(Z|) = H(p1) + (1 )H(p2).

    Since conditioning reduces entropy, we have

    H(Z) H(Z|) H(p1 + (1 )p2) H(p1) + (1 )H(p2),

    which proves H(p) is concave in p.

    17

  • Exercise 2 Consider two containers C1 and C2 (shown below) con-

    taining n1 and n2 molecules, respectively. The energies of the n1 mole-

    cules in C1 are i.i.d. random variables with common distribution p1.

    Similarly, the energies of the molecules in C2 are i.i.d. with common

    distribution p2.

    1. Find the entropies of the ensembles in C1 and C2.

    2. Now assume that the separation between the two containers is re-

    moved. Find the entropy of the mixture and show it is greater than

    the sum of the individual entropies in part a).

    Solution: Let Xi, i = 1, 2, , n1 be the i.i.d. random variables asso-ciated with the energies in container C1 and Yi, i = 1, 2, , n2 thosefor container C2. Then,

    1.

    H(X1, X2, , Xn1) =n1i=1

    H(Xi) = n1H(p1).

    Similarly,

    H(Y1, Y2, , Xn2) =n2i=1

    H(Yi) = n2H(p2).

    2. After the separation is removed, let Zi, i = 1, 2, , (n1 + n2) bethe random energies associated with the molecules in the mixture.

    The (n1 + n2) random variables are still i.i.d. with a common

    distribution given by

    p(Zi) =n1

    n1 + n2p1 +

    n2n1 + n2

    p2.

    18

  • Thus,

    H(Z1, Z2, , Z(n1+n2)) = (n1 + n2)H(n1

    n1 + n2p1 +

    n2n1 + n2

    p2)

    (n1 + n2){

    n1n1 + n2

    H(p1) +n2

    n1 + n2H(p2)

    }= n1H(p1) + n2H(p2),

    where the inequality above is due to the concavity of the entropy.

    Theorem 12 Let (X, Y ) have joint distribution p(x, y) = p(x)p(y|x).The mutual information I(X;Y ) is a concave function of p(x) for fixed

    p(y|x) and a convex function of p(y|x) for a fixed p(x).

    2.5 The Data Processing Theorem

    Definition 12 Three random variables X,Y, and Z form a Markov

    chain in that order, denoted X Y Z, if, conditioned on Y , Z isindependent of X. In this case

    p(x, y, z) = p(x)p(y|x)p(z|x, y) = p(x)p(y|x)p(z|y)

    or

    p(z|x, y) = p(z|y) p(z, x|y) = p(z|y)p(x|y).

    Theorem 13 (Data Processing theorem): Consider the Markov chain

    X Y Z. ThenI(X;Y ) I(X;Z).

    19

  • Proof: We have

    I(X;Y Z) = I(X;Y ) + I(X;Z|Y )= I(X;Z) + I(X;Y |Z).

    However, I(X;Z|Y ) = 0 since X and Z and independent given Y .Thus,

    I(X;Y ) = I(X;Z) + I(X;Y |Z) I(X;Z).

    Equality above is when I(X;Y |Z) = 0, i.e. when X and Y are inde-pendent given Z, or, in other words, when we have a Markov chain

    X Z Y .

    Corollary 2 If Z = g(Y ), then I(X;Y ) I(X; g(Y )).Proof: X Y g(Y ) is a Markov chain.

    2.6 Fanos Inequality

    Consider the simple communication system depicted in Figure 3. We

    are interested in estimating X from the received data Y . Towards this

    end, Y is processed by some function g to obtain an estimate X of

    X. In digital communications we are interested in estimating X in as

    error-free a fashion as possible. Thus, we are interested in minimizing

    the probability of error,

    Pe = Pr[X 6= X].

    Fanos inequality relates the probability of error to the H(X|Y ).

    20

  • P(y|x) g( )X Y X

    ^

    Figure 3: Figure for Fanos Inequality.

    Theorem 14 (Fanos Inequality:)

    H(Pe) + Pe log(|X | 1) H(X|Y ).

    A somewhat weaker inequality is

    1 + Pe log |X | H(X|Y )

    or

    Pe H(X|Y ) 1log |X | .

    Proof:

    Let

    E =

    1 if X 6= X

    0 if X = X.

    Clearly, p(E = 1) = Pe and p(E = 0) = 1 Pe. Then

    H(E,X|Y ) = H(X|Y ) +H(E|X, Y )= H(E|Y ) +H(X|E, Y ).

    Since Y induces X with no uncertainty and X and X determine E,

    H(E|X,Y ) = 0. Thus,

    H(X|Y ) = H(E|Y ) +H(X|E, Y ) H(Pe) +H(X|E, Y ). (1)

    21

  • Now,

    H(X|E, Y ) = Pr(E = 0)H(X|E = 0, Y ) 0

    +Pr(E = 1)H(X|E = 1, Y )

    = Pr(E = 1)H(X|E = 1, Y ) Pe log(|X | 1). (2)

    Combining (1) and (2) we obtain the desired bound.

    The weaker bound is obtained easily by bounding H(Pe) from above

    by 1 and replacing |X | 1 by |X |.

    2.7 Exercises for Chapter 2

    Exercise 3 Consider the Z-channel discussed in Example 1. Com-

    pute and plot the mutual information between the input and output of

    the channel.

    Solution: We have

    I(X;Y ) =x

    y

    p(x, y) logp(x|y)p(x)

    =x

    y

    p(y|x)p(x) log p(y|x)p(y)

    = 1 1 + 2

    log(1 + ) +

    2log .

    Exercise 4 The channel in Figure 5 below is known as the binary sym-

    metric channel (BSC) and it is another simple model for a channel.

    22

  • 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    I(X;Y

    )

    Figure 4: Mutual information between input and output for the z-channel.

    1-p

    1p

    p

    0.5 0

    10.5

    X Y

    0

    1

    p

    Figure 5: The binary symmetric channel.

    Compute the mutual information between the input X and the output

    Y as a function of the cross-over probability p.

    We have

    I(X;Y ) =x

    y

    p(y|x)p(x) log p(y|x)p(y)

    Now,

    p(Y = 0) =1

    2(1 p) + 1

    2p =

    1

    2

    p(Y = 1) =1

    2

    23

  • Thus,

    I(X;Y ) = (1 p) log[2(1 p)] + p log(2p)= 1 + p log p+ (1 p) log(1 p)= 1 hb(p),

    where

    hb(p) = p log p (1 p) log(1 p)

    is the binary entropy function.

    A plot of the mutual information as a function of p is given below:

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Figure 6: The mutual information between input and output for a BSC.

    Exercise 5 Show that if X is a function of Y then H(Y ) H(X).Solution: By the corollary to the data processing theorem,

    I(Y ;Y ) I(Y ;X) H(Y ) H(X)H(X|Y ) = H(X).

    24

  • Another solution:

    H(Y ) = H(X,Y )H(X|Y ) 0

    = H(X) +H(Y |X) H(X).

    Exercise 6 Consider the simple rate 2/3 parity-check code where the

    third bit, known as the parity bit, is the exclusive or of the first two bits:

    000

    011

    101

    110

    Let the first two (information bits) be defined by the random vector

    X = (X1, X2) and the parity-bit by random variable Y .

    1. How much uncertainty is resolved about what X is by observing

    Y ?

    2. How much uncertainty about X is resolved by observing Y and

    X2?

    3. Now suppose the parity bit Y is observed through a BSC with a

    cross-over probability that produces an output Z. How much

    uncertainty is resolved about X by observing Z?

    Solution:

    25

  • 1. We need to compute I(X;Y ). We have

    I(X;Y ) = H(X)H(X|Y ).

    Clearly,

    H(X) = 2 bits

    Now,

    H(X|Y ) = H(X1|Y ) +H(X2|X1, Y ) 0

    = Pr(Y = 0)H(X1|Y = 0) + Pr(Y = 1)H(X1|Y = 1)=

    1

    2H(X1|Y = 0) + 1

    2H(X1|Y = 1)

    = H(X1|Y = 0) = 1

    Thus,

    I(X;Y ) = 2 1 = 1 bit.

    2.

    I(X;Y,X2) = H(X)H(X|Y,X2) 0

    = 2 bits

    3. We have

    I(X;Z, Y ) = I(X;Z) + I(X;Y |Z)= I(X;Y ) + I(X;Z|Y )

    0

    I(X;Z) = I(X;Y ) I(X;Y |Z).

    Now,

    I(X;Y |Z) = 12I(X;Y |Z = 0) + 1

    2I(X;Y |Z = 1)

    = I(X;Y |Z = 0),26

  • since p(Z = 0) = p(Z = 1) = 1/2 and, due to symmetry, it can be

    argued I(X;Y |Z = 0) = I(X;Y |Z = 1) (verify this). Then,

    I(X;Y |Z = 0) =x

    1y=0

    p(X = x, Y = y|Z = 0)

    logp(X = x|Y = y, Z = 0)

    p(X = x|Z = 0)

    =x

    1y=0

    p(Y = y|X = x, Z = 0)p(X = x|Z = 0)

    logp(X = x|Y = y)p(X = x|Z = 0)

    =x

    1y=0

    p(Y = y|X = x)p(X = x|Z = 0)

    logp(Y = y|X = x)p(X = x)p(X = x|Z = 0)p(Y = y)

    =x

    p(X = x|Z = 0) log 12p(X = x|Z = 0)

    = hb().

    Thus,

    I(X;Z) = 1 hb().

    27