Information Theory Entropy Relative Entropy

  • Upload
    james

  • View
    265

  • Download
    1

Embed Size (px)

Citation preview

  • 8/2/2019 Information Theory Entropy Relative Entropy

    1/60

    Entropy, Relative Entropyand Mutual Information

    Prof. Ja-Ling Wu

    Department of Computer Scienceand Information EngineeringNational Taiwan University

  • 8/2/2019 Information Theory Entropy Relative Entropy

    2/60

    Information Theory 2

    Definition: The Entropy H(X) of a discrete

    random variable X is defined by

    entropythechangenotdoesyprobabilitzerooftermsadding:

    0)xas(xlogx00log0

    bits:H(P)2base:log

    )(log)()(

    XxPH

    xPxPxH

  • 8/2/2019 Information Theory Entropy Relative Entropy

    3/60

    Information Theory 3

    Note that entropyis a function of the distribution

    of X. It does not depend on the actual valuestaken by the r.v. X, but only on the probabilities.

    )(

    1

    )(1

    ))((

    log

    logofvalueexpectedtheXofentropyThe:Remark

    aswrittenis

    g(x)..theofvalueexpectedthethen,,If

    xP

    xP

    XxxEg

    p

    ExH

    xPxgxgE

    vrxPX

    Expectation value

    Self-information

  • 8/2/2019 Information Theory Entropy Relative Entropy

    4/60

    Information Theory 4

    Lemma 1.1: H(x) 0

    Lemma 1.2: Hb(x) = (logba) Ha(x)Ex:

    )()1log()1(log)(

    1)1(,1

    )0(,0

    2 PHPPPPXH

    PP

    PPX

    def

    1

    0.9

    0.8

    0.7

    0.6

    0.50.4

    0.3

    0.2

    0.1

    0

    0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

    H2(P)1) H(x)=1 bits when P=1/2

    2) H(x) is a concave function of P

    3) H(x)=0 if P=0 or 14) max H(x) occurs when P=1/2

  • 8/2/2019 Information Theory Entropy Relative Entropy

    5/60

    Information Theory 5

    Joint Entropy and Conditional Entropy Definition: The joint entropy H(X, Y) of a pair of discrete random

    variables (X, Y) with a joint distribution P(x, y) is defined as

    Definition: The conditional entropy H(Y|X) is defined as

    ),(log),(

    ),(log),(),(

    YXPEYXH

    or

    yxPyxPYXHXx Yy

    )|(log

    )|(log),(

    )|(log)|()(

    asdefinedis|)(|

    ),( XYPE

    xyPyxP

    xyPxyPxP

    xXYHxPXYH

    yxP

    Xx Yy

    Xx Yy

    Xx

  • 8/2/2019 Information Theory Entropy Relative Entropy

    6/60

    Information Theory 6

    Theorem 1.1 (Chain Rule):

    )|(log)(log),(log

    can writewely,equivalentor

    )|()(

    )|(log),()(log)(

    )|(log),()(log),(

    )|()(log),(

    ),(log),(),(

    :)|()(),(

    XYPXPYXP

    XYHXH

    xyPyxPxPxP

    xyPyxPxPyxP

    xyPxPyxP

    yxPyxPYXH

    pfXYHXHYXH

    Xx Xx Yy

    Xx YyXx Yy

    Xx Yy

    Xx Yy

  • 8/2/2019 Information Theory Entropy Relative Entropy

    7/60

    Information Theory 7

    Corollary:

    H(X, Y|Z) = H(X|Z) + H(Y|X,Z)

    Remark:

    (i) H(Y|X) H(X|Y)

    (II) H(X)H(X|Y) = H(Y)H(Y|X)

  • 8/2/2019 Information Theory Entropy Relative Entropy

    8/60

    Information Theory 8

    Relative Entropy and Mutual Information

    The entropy of a random variable is a measure

    of the uncertainty of the random variable; it isa measure of the amount of informationrequired on the average to describe therandom variable.

    The relative entropy is a measure of thedistance between two distributions. In statistics,it arises as an expected logarithm of thelikelihood ratio. The relative entropy D(p||q) isa measure of the inefficiency of assuming thatthe distribution is q when the true distributionis p.

  • 8/2/2019 Information Theory Entropy Relative Entropy

    9/60

    Information Theory 9

    Ex: If we knew the true distribution of the r.v.,

    then we could construct a code with averagedescription length H(p). If instead, we used thecode for a distribution q, we would needH(p)+D(p||q) bits on the average to describe

    the r.v..

  • 8/2/2019 Information Theory Entropy Relative Entropy

    10/60

    Information Theory 10

    Definition:

    The relative entropy or Kullback Lieblerdistance between two probability massfunctions p(x) and q(x) is defines as

    )(

    1log

    )(

    1log

    )(

    1log

    )(

    1log

    )(

    )(log

    )(

    )(

    log)(||

    xpE

    xqE

    xpxqE

    xq

    xpE

    xq

    xp

    xpqpD

    pp

    pp

    Xx

  • 8/2/2019 Information Theory Entropy Relative Entropy

    11/60

    Information Theory 11

    Definition:

    Consider two r.v.s X and Y with a jointprobability mass function p(x,y) and marginalprobability mass functions p(x) and p(y). Themutual information I(X;Y) is the relative

    entropy between the joint distribution and theproduct distribution p(x)p(y), i.e.,

    )()(

    ),(log

    )()(||),(

    )()(

    ),(log),();(

    ),(YPXP

    YXPE

    ypxpyxpD

    ypxp

    yxpyxpYXI

    yxp

    Xx Yy

  • 8/2/2019 Information Theory Entropy Relative Entropy

    12/60

    Information Theory 12

    Ex: Let X = {0, 1} and consider two distributions p andq on X. Let p(0)=1-r, p(1)=r, and let q(0)=1-s, q(1)=s.Then

    If r=s, then D(p||q)=D(q||p)=0While, in general,

    D(p||q)D(q||p)

    r

    ss

    r

    ss

    p

    qq

    p

    qqpqDand

    s

    rr

    s

    rr

    q

    pp

    q

    ppqpD

    log1

    1log)1(

    )1(

    )1(log)1(

    )0(

    )0(log)0(||

    log1

    1log)1(

    )1(

    )1(log)1(

    )0(

    )0(log)0(||

  • 8/2/2019 Information Theory Entropy Relative Entropy

    13/60

    Information Theory 13

    Relationship between Entropy

    and Mutual Information

    Rewrite I(X;Y) as

    )|()(

    )|(log),()(log)(

    )|(log),()(log),(

    )(

    )|(log),(

    )()(

    ),(log),();(

    ,

    ,,

    ,

    ,

    YXHXH

    yxpyxpxpxp

    yxpyxpxpyxp

    xp

    yxpyxp

    ypxp

    yxpyxpYXI

    yxx

    yxyx

    yx

    yx

  • 8/2/2019 Information Theory Entropy Relative Entropy

    14/60

    Information Theory 14

    Thus the mutual information I(X;Y) is the reduction in theuncertainty of X due to the knowledge of Y.

    By symmetry, it follows that

    I(X;Y) = H(Y)H(Y|X)

    X says much about Y as Y says about X

    Since H(X,Y) = H(X) + H(Y|X) I(X;Y) = H(X) + H(Y)H(X,Y)

    I(X;X) = H(X)H(X|X) = H(X)

    The mutual information of a r.v. with itself is the entropyof the r.v. entropy : self-information

  • 8/2/2019 Information Theory Entropy Relative Entropy

    15/60

    Information Theory 15

    Theorem: (Mutual information and entropy):

    i. I(X;Y) = H(X)H(X|Y)= H(Y)H(Y|X)

    = H(X) + H(Y)H(X,Y)

    ii. I(X;Y) = I(Y;X)

    iii. I(X;X) = H(X)

    I(X;Y)

    H(X|Y)

    H(Y|X)

    H(X,Y)

    H(X) H(Y)

  • 8/2/2019 Information Theory Entropy Relative Entropy

    16/60

    Information Theory 16

    Chain Rules for Entropy, Relative Entropy

    and Mutual Information

    Theorem: (Chain rule for entropy)

    Let X1, X2, , Xn, be drawn according toP(x1, x2, , xn).

    Then

    n

    i

    iin XXXHXXXH1

    1121 ),,|(),,,(

  • 8/2/2019 Information Theory Entropy Relative Entropy

    17/60

    Information Theory 17

    Proof

    (1)

    n

    i

    ii

    nnn

    XXXH

    XXXHXXHXHXXXH

    XXXHXXHXH

    XXXHXHXXXH

    XXHXHXXH

    1

    11

    1112121

    123121

    1321321

    12121

    ),,|(

    ),,|()|()(),,,(

    ),|()|()(

    )|,()(),,(

    )|()(),(

  • 8/2/2019 Information Theory Entropy Relative Entropy

    18/60

    Information Theory 18

    (2) We write

    n

    i

    ii

    n

    i XXX

    iii

    n

    i XXX

    iin

    XXX

    ii

    n

    i

    n

    XXX

    n

    iiin

    XXX

    nn

    n

    n

    i

    iin

    XXXH

    xxxPxxxP

    xxxPxxxP

    xxxPxxxP

    xxxPxxxP

    xxxPxxxP

    XXXHthen

    xxxPxxxP

    i

    n

    n

    n

    n

    1

    11

    1 ,,,

    1121

    1 ,,,

    1121

    ,,,

    11

    1

    21

    ,,, 11121

    ,,,

    2121

    21

    1

    1121

    ,,|

    ,,|log,,,

    ,,|log,,,

    ,,|log,,,

    ,,|log,,,

    ,,,log,,,

    ,,,

    ,,|,,,

    21

    21

    21

    21

    21

  • 8/2/2019 Information Theory Entropy Relative Entropy

    19/60

    Information Theory 19

    Definition:

    The conditional mutual information of rvs. Xand Y given Z is defined by

    )|()|()|,(log

    ),|()|()|;(

    ),,(ZYPZXP

    ZYXPE

    ZYXHZXHZYXI

    zyxp

  • 8/2/2019 Information Theory Entropy Relative Entropy

    20/60

    Information Theory 20

    Theorem: (chain rule for mutual-information)

    proof:

    n

    i

    iin XXYXIYXXXI1

    1121 ),,|;();,,,(

    n

    i

    iii

    n

    i

    ii

    n

    i

    ii

    nn

    n

    XXXYXI

    YXXXHXXXH

    YXXXHXXXH

    YXXXI

    1

    121

    1

    11

    1

    11

    2121

    21

    ),,,|;(

    ),,,|(),,|(

    )|,,,(),,,(

    );,,,(

  • 8/2/2019 Information Theory Entropy Relative Entropy

    21/60

    Information Theory 21

    Definition:

    The conditional relative entropy D(p(y|x) || q (y|x)) isthe average of the relative entropies between theconditional probability mass functions p(y|x) and q(y|x)averaged over the probability mass function p(x).

    Theorem: (Chain rule for relative entropy)

    D(p(x,y)||q(x,y)) = D(p(x)||q(x))+ D(p(y|x)||q(y|x))

    XYq

    XYpE

    xyqxypxypxpxyqxypD

    yxp

    x y

    |

    |log

    ||log|||||

    ,

  • 8/2/2019 Information Theory Entropy Relative Entropy

    22/60

    Information Theory 22

    Jensens Inequality and Its Consequences

    Definition: A function is said to be convex over aninterval (a,b) if for every x

    1

    , x2

    (a,b) and0 1, f(x1+(1-)x2)f(x1)+(1-)f(x2)A function f is said to be strictly convexif equality holds only if =0 or =1.

    Definition: A function is concave iff is convex.

    Ex: convex functions: X2, |X|, eX,XlogX (for X0)

    concave functions: logX, X1/2 for X0

    both convex and concave: ax+b; linear functions

  • 8/2/2019 Information Theory Entropy Relative Entropy

    23/60

    Information Theory 23

    Theorem:

    If the function f has a second derivative whichis non-negative (positive) everywhere, then thefunction is convex (strictly convex).

    casecontinuous:)(

    casediscrete:)(

    dxxxpEX

    xxpEXXx

  • 8/2/2019 Information Theory Entropy Relative Entropy

    24/60

    Information Theory 24

    Theorem : (Jensens inequality):If f(x) is convex function and X is a random variable, then Ef(X) f(EX).

    Proof: For a two mass point distribution, the inequality becomes

    p1f(x1)+p2f(x2) f(p1x1+p2x2), p1+p2=1

    which follows directly from the definition of convex functions.Suppose the theorem is true for distributions with K-1 mass points.

    Then writing Pi=Pi/(1-PK) for i = 1, 2, , K-1, we have

    The proof can be extended to continuous distributions by continuity arguments.

    (Mathematical Induction)

    k

    i

    ii

    k

    i

    iik

    k

    i

    iik

    k

    i

    iikkkk

    k

    i

    iikkk

    k

    i

    iikkk

    k

    i

    iikkki

    k

    i

    i

    xpf

    xppf

    xppf

    xppxppf

    xppxpf

    xpfpxfp

    xfppxfpxfp

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    11

    )(

    ))1((

    ))1((

    ))1()1((

    ))1((

    )()1()(

    )()1()()(

  • 8/2/2019 Information Theory Entropy Relative Entropy

    25/60

    Information Theory 25

    Theorem: (Information inequality):

    Let p(x), q(x) xX, be two probability mass functions. Then

    D(p||q) 0

    with equality iff p(x)=q(x) for all x.

    Proof: Let A={x:p(x)>0} be the support set of p(x). Then

    01log

    )(log

    )(log

    concave)istlog()(

    )()(log

    )(

    )(log

    )(

    )(log

    )(

    )(log)(

    )(

    )(log)()||(

    Xx

    Ax

    Ax

    Ax

    Ax

    xq

    xq

    xp

    xqxp

    xp

    xqE

    xp

    xqE

    xp

    xqxp

    xq

    xpxpqpD

  • 8/2/2019 Information Theory Entropy Relative Entropy

    26/60

    Information Theory 26

    Corollary: (Non-negativity of mutual information):

    For any two rvs., X, Y,

    I(X;Y) 0

    with equality iff X and Y are independent.

    Proof:

    I(X;Y) = D(p(x,y)||p(x)p(y)) 0 with equality iffp(x,y)=p(x)p(y), i.e., X and Y are independent

    Corollary:

    D(p(y|x)||q (y|x)) 0

    with equality iff p(y|x)=q(y|x) for all x and y with p(x)>0.

    Corollary:

    I(X;Y|Z) 0

    with equality iff X and Y are conditionary independent given Z.

  • 8/2/2019 Information Theory Entropy Relative Entropy

    27/60

    Information Theory 27

    Theorem:H(x)log|X|, where |X| denotes the number ofelements in the range of X, with equality iff X has auniform distribution over X.

    Proof:Let u(x)=1/|X| be the uniform probability mass function

    over X, and let p(x) be the probability mass function forX. Then

    )(log)||(0

    entropyrelativeofnegativity-nonby theHence

    )(log

    )(

    )(log)()||(

    xHupD

    xH

    xu

    xpxpupD

    X

    X

  • 8/2/2019 Information Theory Entropy Relative Entropy

    28/60

    Information Theory 28

    Theorem: (conditioning reduces entropy):

    H(X|Y) H(X)with equality iff X and Y are independent.

    Proof: 0 I(X;Y)=H(X)H(X|Y)

    Note that this is true only on the average; specifically,

    H(X|Y=y) may be greater than or less than or equal to

    H(X), but on the average H(X|Y)=yp(y)H(X|Y=y) H(X).

  • 8/2/2019 Information Theory Entropy Relative Entropy

    29/60

    Information Theory 29

    Ex: Let (X,Y) have the following joint

    distribution

    Then, H(X)=H(1/8, 7/8)=0.544 bits

    H(X|Y=1)=0 bits

    H(X|Y=2)=1 bits > H(X)However, H(X|Y) = 3/4 H(X|Y=1)+1/4 H(X|Y=2)

    = 0.25 bits < H(X)

    XY

    1 2

    1 0 3/4

    2 1/8 1/8

  • 8/2/2019 Information Theory Entropy Relative Entropy

    30/60

    Information Theory 30

    Theorem: (Independence bound on entropy):

    Let X1, X2, ,Xn be drawn according to p(x1, x2, ,xn).Then

    with equality iff the Xi are independent.

    Proof: By the chain rule for entropies,

    with equality iff the Xis are independent.

    n

    i

    in XHXXXH1

    21 ,,,

    n

    i

    i

    n

    i

    iin

    XH

    XXXHXXXH

    1

    1

    1121 ,,|,,,

    The LOG SUM INEQUALITY AND ITS

  • 8/2/2019 Information Theory Entropy Relative Entropy

    31/60

    Information Theory 31

    The LOG SUM INEQUALITY AND ITS

    APPLICATIONS

    Theorem: (Log sum inequality)

    For non-negative numbers, a1, a2, , an and b1, b2.. bn

    with equality iff ai/bi = constant.

    n

    i i

    n

    i

    in

    i

    i

    n

    i i

    ii

    b

    a

    ab

    aa

    1

    1

    11

    loglog

    00log

    0aifalog0,0log0:sconventionsome

    00

    0a

  • 8/2/2019 Information Theory Entropy Relative Entropy

    32/60

    Information Theory 32

    Proof:Assume w.l.o.g that ai>0 and bi>0. The functionf(t)=tlogt is strictly convex, sincefor all positive t. Hence by Jensens inequality, we

    have

    which is the log sum inequality.

    0log1

    " et

    tf

    i

    i

    i

    i

    ii

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    ii

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    i

    iin

    ii

    ii

    i

    ii

    iiii

    b

    aa

    b

    aa

    b

    b

    a

    b

    a

    b

    a

    b

    a

    b

    b

    b

    a

    b

    b

    b

    a

    b

    b

    b

    a

    b

    a

    b

    b

    b

    at

    b

    b

    tftf

    loglog

    )0that(noteloglog

    loglogobtainwe

    ,andSetting.1,0for

    1

  • 8/2/2019 Information Theory Entropy Relative Entropy

    33/60

    Information Theory 33

    Reproving the theorem that D(p||q) 0, with equality iffp(x)=q(x)

    with equality iff p(x)/q(x)=c. Since both p and q are

    probability mass functions, c=1 p(x)=q(x), x.

    0

    1

    1log1

    )inequalitysum-logfrom(log

    log||

    xq

    xpxp

    xq

    xpxpqpD

  • 8/2/2019 Information Theory Entropy Relative Entropy

    34/60

    Information Theory 34

    Theorem:D(p||q) is convex in the pair (p,q), i.e., if (p1, q1) and(p2, q2) are two pairs of probability mass functions,then

    Proof:

    10

    )||()1()||())1(||)1(( 22112121

    allfor

    qpDqpDqqppD

    )||()1()||(

    log)1(log

    )1(

    )1(log)1(loglog

    log)1(

    )1(,

    )1(,

    )1()1(

    )1(log)1(

    ))1(||)1((

    2211

    2

    22

    1

    11

    2

    22

    1

    11

    2

    1

    2

    1

    2

    12

    1

    2211

    2211

    21

    2121

    2121

    qpDqpDq

    pp

    q

    pp

    q

    pp

    q

    pp

    b

    aa

    b

    a

    athen

    qbqb

    papaLet

    qq

    pppp

    qqppD

    i i

    ii

    i

    i

    ii

    i

    i

    log-sum

  • 8/2/2019 Information Theory Entropy Relative Entropy

    35/60

    Information Theory 35

    Theorem: (concavity of entropy):

    H(p) is a concave function of P.That is: H(1p1+(1-)p2)H(p1)+(1-)H(p2)

    Proof:H(p)=log|X|D(p||u)

    where u is the uniform distribution on |X|outcomes. The concavity of H then follows

    directly from the convexity of D.

  • 8/2/2019 Information Theory Entropy Relative Entropy

    36/60

    Information Theory 36

    Theorem: Let (X,Y)~p(x,y) = p(x)p(y|x).

    The mutual information I(X;Y) is(i) a concave function of p(x) for fixed p(y|x)

    (ii) a convex function of p(y|x) for fixed p(x).

    Proof:

    (1) I(X;Y)=H(Y)-H(Y|X)=H(Y)xp(x)H(Y|X=x) ()if p(y|x) is fixed, then p(y) is a linear function ofp(x). ( p(y) = xp(x,y) = xp(x)p(y|x) )Hence H(Y), which is a concave function of

    p(y), is a concave function of p(x). The secondterm of () is a linear function of p(x). Hence thedifference is a concave function of p(x).

  • 8/2/2019 Information Theory Entropy Relative Entropy

    37/60

    Information Theory 37

    (2) We fix p(x) and consider two different conditional distributionsp1(y|x) and p2(y|x). The corresponding joint distributions arep1(x,y)=p(x) p1(y|x) and p2(x,y)=p(x) p2(y|x), and their respective

    marginals are p(x), p1(y) and p(x), p2(y).Consider a conditional distribution

    p(y|x)= p1(y|x)+(1-)p2(y|x)

    that is a mixture of p1(y|x) and p2(y|x). The corresponding jointdistribution is also a mixture of the corresponding joint

    distributions,p(x,y) = p1 (x,y)+(1-)p2(x,y)

    and the distribution of Y is also a mixture p(y)= p1 (y)+(1-)p2(y).Hence if we let q(x,y)=p(x)p(y) q(x,y)= q1 (x,y)+(1-)q2(x,y).

    I(X;Y) = D(p||q) convex of (p,q)

    the mutual information is a convex function of the conditionaldistribution. Therefore, the convexity of I(X;Y) is the same as thatof the D(p||q) w.r.t. pi(y|x) when p(x) is fixed.

    The product of the marginal distributions

    when p(x) is fixed,

    p(x,y) is linear with pi(y|x)

    q(x,y) is also linear with pi(y|x)

    when p(x) is fixed.

  • 8/2/2019 Information Theory Entropy Relative Entropy

    38/60

    Information Theory 38

    Data processing inequality:

    No clever manipulation of the data can improve theinferences that can be made from the data

    Definition:Rvs. X,Y,Z are said to form a Markov chain in thatorder (denoted by XYZ) if the conditionaldistribution of Z depends only on Y and is conditionally

    independent of X. That is XYZ form a Markovchain, then

    (i) p(x,y,z)=p(x)p(y|x)p(z|y)

    (ii) p(x,z|y)=p(x|y)p(z|y) :X and Z are conditionally

    independent given Y XYZ implies that ZYX

    If Z=f(Y), then XYZ

  • 8/2/2019 Information Theory Entropy Relative Entropy

    39/60

    Information Theory 39

    Theorem: (Data processing inequality)

    if XYZ , then I(X;Y) I(X;Z)No processing of Y, deterministic or random, canincrease the information that Y contains about X.

    Proof:I(X;Y,Z) = I(X;Z) + I(X;Y|Z) : chain rule

    = I(X;Y) + I(X;Z|Y) : chain rule

    Since X and Z are independent given Y, we have

    I(X;Z|Y)=0. Since I(X;Y|Z)0, we have I(X;Y)I(X;Z)

    with equality iff I(X;Y|Z)=0, i.e., XZY forms aMarkov chain. Similarly, one can prove I(Y;Z)I(X;Z).

  • 8/2/2019 Information Theory Entropy Relative Entropy

    40/60

    Information Theory 40

    Corollary:If XYZ forms a Markov chain and if Z=g(Y), wehave I(X;Y)I(X;g(Y))

    : functions of the data Y cannot increase theinformation about X.

    Corollary: If XYZ, then I(X;Y|Z)I(X;Y)Proof: I(X;Y,Z)=I(X;Z)+I(X;Y|Z)

    =I(X;Y)+I(X;Z|Y)

    By Markovity, I(X;Z|Y)=0

    and I(X;Z) 0 I(X;Y|Z)I(X;Y) The dependence of X and Y is decreased (or remains

    unchanged) by the observation of a downstream r.v. Z.

  • 8/2/2019 Information Theory Entropy Relative Entropy

    41/60

    Information Theory 41

    Note that it is possible that I(X;Y|Z)>I(X;Y)

    when X,Y and Z do not form a Markov chain.

    Ex: Let X and Y be independent fair binary rvs, andlet Z=X+Y. Then I(X;Y)=0, but

    I(X;Y|Z) =H(X|Z)H(X|Y,Z)=H(X|Z)

    =P(Z=1)H(X|Z=1)=1/2 bit.

    F i li

  • 8/2/2019 Information Theory Entropy Relative Entropy

    42/60

    Information Theory 42

    Fanos inequality:

    Fanos inequality relates the probability of error in

    guessing the r.v. X to its conditional entropy H(X|Y).

    Note that:

    The conditional entropy of a r.v. X given anotherrandom variable Y is zero iff X is a function of Y.

    proof: HW we can estimate X from Y with zero probability oferror iff H(X|Y)=0.

    we expect to be able to estimate X with a low

    probability of error only if the conditional entropyH(X|Y) is small.

    Fanos inequality quantifies this idea.

    H(X|Y)=0 implies there is no uncertainty about X if we know Y

    for all x with p(x)>0, there is only one possible value of y with p(x,y)>0

  • 8/2/2019 Information Theory Entropy Relative Entropy

    43/60

    Information Theory 43

    Suppose we wish to estimate a r.v. X with a

    distribution p(x). We observe a r.v. Y which isrelated to X by the conditional distributionp(y|x). From Y, we calculate a function

    which is an estimate of X. We wish to bound

    the probability that . We observe thatforms a Markov chain.

    Define the probability of error

    XYg

    XX

    XYX

    XYgPXXPP rre

    )(

  • 8/2/2019 Information Theory Entropy Relative Entropy

    44/60

    Information Theory 44

    Theorem: (Fanos inequality)

    For any estimator X such that XYX with Pe=Pr(XX),we have

    H(Pe) + Pelog(|X|-1) H(X|Y)

    This inequality can be weakened to1 + Pelog(|X|) H(X|Y)

    or

    Remark: Pe = 0 H(X|Y) = 0

    H(Pe)1, E: binary r.v.log(|X|-1) log|X|

    ||log

    1|

    X

    YXHPe

    ^ ^^

  • 8/2/2019 Information Theory Entropy Relative Entropy

    45/60

    Information Theory 45

    Proof: Define an error rv.

    By the chain rule for entropyies, we have

    H(E,X|X) =H(X|X) + H(E|X,X)

    =H(E|X) + H(X|E,X)

    Since conditioning reduces entropy, H(E|X) H(E)= H(Pe). Nowsince E is a function of X and X H(E|X,X)=0. Since E is abinary-valued r.v., H(E)= H(Pe).

    The remaining term, H(X|E,X), can be bounded as follows:H(X|E,X) = Pr(E=0)H(X|X,E=0)+Pr(E=1)H(X|X,E=1)

    (1- Pe)0 + Pelog(|X|-1),

    X

    X

    EXif,0

    Xif,1

    =0

    H(Pe) Pelog(|X|-1)

    ^ ^ ^

    ^ ^

    ^^ ^

    ^^ ^ ^

  • 8/2/2019 Information Theory Entropy Relative Entropy

    46/60

    Information Theory 46

    Since given E=0, X=X, and given E=1, we can upper bound theconditional entropy by the log of the number of remaining

    outcomes (|X

    |-1).

    H(Pe)+Pelog|X|H(X|X). By the data processing inequality, wehave I(X;X)I(X;Y) since XYX, and therefore H(X|X) H(X|Y).Thus we have H(Pe)+Pelog|X| H(X|X) H(X|Y).

    Remark:Suppose there is no knowledge of Y. Thus X must be guessedwithout any information. Let X{1,2,,m} and P1P2 Pm.Then the best guess of X is X=1 and the resulting probability oferror is Pe=1 - P1.

    Fanos inequality becomesH(Pe) + Pelog(m-1) H(X)

    The probability mass function

    (P1, P2,, Pm) = (1-Pe, Pe/(m-1),, Pe/(m-1) )

    achieves this bound with equality.

    ^

    ^^ ^ ^

    ^

    ^

    S P ti f th R l ti E t

  • 8/2/2019 Information Theory Entropy Relative Entropy

    47/60

    Information Theory 47

    Some Properties of the Relative Entropy

    1. Let n and n be two probability distributions on thestate space of a Markov chain at time n, and let n+1

    and n+1 be the corresponding distributions at timen+1. Let the corresponding joint mass function bedenoted by p and q.

    That is,

    p(xn, xn+1) = p(xn) r(xn+1| xn)

    q(xn, xn+1) = q(xn) r(xn+1| xn)

    where

    r( | ) is the probability transition function for the

    Markov chain.

  • 8/2/2019 Information Theory Entropy Relative Entropy

    48/60

    Information Theory 48

    Then by the chain rule for relative entropy, we have

    the following two expansions:

    D(p(xn, xn+1)||q(xn, xn+1))

    = D(p(xn)||q(xn)) + D(p(xn+1|xn)||q(xn+1|xn))

    = D(p(xn+1)||q(xn+1)) + D(p(xn|xn+1)||q(xn|xn+1))

    Since both p and q are derived from the same Markovchain, so

    p(xn+1|xn) = q(xn+1|xn) = r(xn+1|xn),

    and hence

    D(p(xn+1|xn)) || q(xn+1|xn)) = 0

  • 8/2/2019 Information Theory Entropy Relative Entropy

    49/60

    Information Theory 49

    That is,

    D(p(xn) || q(x

    n))

    = D(p(xn+1) || q(xn+1)) + D(p(xn|xn+1) || q(xn|xn +1))

    Since D(p(xn|xn+1) || q(xn|xn +1)) 0

    D(p(xn) || q(xn)) D(p(xn+1) || q(xn+1))

    or D(n|| n) D (n+1|| n+1)

    Conclusion:

    The distance between the probability massfunctions is decreasing with time n for anyMarkov chain.

  • 8/2/2019 Information Theory Entropy Relative Entropy

    50/60

    Information Theory 50

    2. Relative entropy D(n|| ) between a distribution non the states at time n and a stationary distribution

    decreases with n.In the last equation, if we let n be any stationarydistribution , then n+1 is the same stationary

    distribution. Hence

    D(n|| ) D (n+1|| )

    Any state distribution gets closer and closer to eachstationary distribution as time passes. 0||lim

    nn

    D

  • 8/2/2019 Information Theory Entropy Relative Entropy

    51/60

    Information Theory 51

    3. Def:A probability transition matrix [Pij],

    Pij

    = Pr{x

    n+1=j|x

    n=i} is called doubly stochastic if

    iPij=1, i=1,2,, j=1,2,

    and

    jPij=1, i=1,2,, j=1,2,

    The uniform distribution is a stationary distribution ofP iff the probability transition matrix is doublystochastic.

  • 8/2/2019 Information Theory Entropy Relative Entropy

    52/60

    Information Theory 52

    4. The conditional entropy H(Xn|X1) increase with n for astationary Markov process.

    If the Markov process is stationary, then H(Xn) isconstant. So the entropy is non-increasing. However,it can be proved that H(Xn|X1) increases with n. Thisimplies that:

    the conditional uncertainty of the future increases.Proof:

    H(Xn|X1) H(Xn|X1, X2) (conditioning reduces entropy)

    = H(Xn|X2) (by Markovity)

    = H(Xn-1|X1) (by stationarity)

    Similarly: H(X0|Xn) is increasing in n for any Markov chain.

    Sufficient Statistics

  • 8/2/2019 Information Theory Entropy Relative Entropy

    53/60

    Information Theory 53

    Sufficient Statistics

    Suppose we have a family of probability mass function{f(x)} indexed by , and let X be a sample from a

    distribution in this family. Let T(X) be any statistic(function of the sample) like the sample mean or samplevariance. Then

    XT(X),

    And by the data processing inequality, we have

    I(;T(X)) I(;X)

    for any distribution on . However, if equality holds, noinformation is lost.

    A statistic T(X) is called sufficient for if itcontains all the information in X about .

  • 8/2/2019 Information Theory Entropy Relative Entropy

    54/60

    Information Theory 54

    Def:

    A function T(X) is said to be a sufficientstatistic relative to the family {f(x)} if X isindependent of give T(X), i.e., T(X)Xforms a Markov chain.

    or:I(;X) = I(; T(X))

    for all distributions on

    Sufficient statistics preserve mutual information.

    Some examples of Sufficient Statistics

  • 8/2/2019 Information Theory Entropy Relative Entropy

    55/60

    Information Theory 55

    Some examples of Sufficient Statistics

    1. Let be an i.i.d.sequence of coin tosses of a coin with unknown

    parameter .

    Given n, the number of 1s is a sufficient statistics

    for .

    Here

    Given T, all sequences having that many 1s are

    equally likely and independent of the parameter .

    {0,1} in21 , X, X,, XX

    1)Pr(X i

    .)(1

    n

    i

    in21 X, X,, XXT

  • 8/2/2019 Information Theory Entropy Relative Entropy

    56/60

    Information Theory 56

    .

    ),...,,(,

    ,0

    ,1

    ),...,,(),...,,(Pr

    21

    12121

    forstatisticssufficientaisTand

    XXXXThus

    otherwise

    kxif

    k

    n

    kxxxxXXX

    ni

    i

    n

    i

    inn

  • 8/2/2019 Information Theory Entropy Relative Entropy

    57/60

    Information Theory 57

    2. If Xis normally distributed with mean and variance 1; that is,

    if

    and are drawn

    independently according to ,

    a sufficient statistic for is the sample mean

    This can be verified that

    is independent of .

    )1,(2

    12

    )( 2

    Nef

    x

    n21 , X,, XX

    f

    ),|( nX, X,, XXP nn21

    .1

    1

    n

    i

    in Xn

    X

  • 8/2/2019 Information Theory Entropy Relative Entropy

    58/60

    Information Theory 58

    The minimal sufficient statistics is a sufficient statisticsthat is a function of all other sufficient statistics.

    Def:

    A static T(X) is a minimal sufficient statistic related toif it is a function of every other sufficient

    statistic

    Hence, a minimal sufficient statistic maximallycompresses the information about in the sample.Other sufficient statistics may contain additional

    irrelevant information.

    The sufficient statistics of the above examples areminimal.

    )(XfXXUXTU )()(:

  • 8/2/2019 Information Theory Entropy Relative Entropy

    59/60

    Information Theory 59

    Shuffles increase Entropy:

    If T is a shuffle (permutation) of a deck of cards and

    X is the initial (random) position of the cards in thedeck and if the choice of the shuffle T is independentof X, then

    H(TX) H(X)

    where TX is the permutation of the deck induced bythe shuffle T on the initial permutation X.

    Proof: H(TX) H(TX|T)

    = H(T-1TX|T) (why?)

    = H(X|T)

    = H(X)

    if X and T are independent!

  • 8/2/2019 Information Theory Entropy Relative Entropy

    60/60

    If X and X are i.i.d. with entropy H(X), then Pr(X=X) 2-H(X)

    with equality iff X has a uniform distribution.

    pf: suppose X~p(x). By Jensens inequality, we have

    2Elogp(x) E2logp(x)

    which implies that 2-H(X)=2p(x)logp(x)p(x)2logp(x)=p2(x)=Pr(X=X)

    ( Let X and X be two i.i.d. rvs with entropy H(X). The prob.at X=X is given by Pr(X=X)= p

    2(x) )

    Let X, X be independent with X~p(x), X~r(x), x, x

    Then Pr(X=X) 2

    -H(p)-D(p||r)

    Pr(X=X) 2-H(r)-D(r||p)

    pf: 2-H(p)-D(p||r)= 2p(x)logp(x)+p(x)logr(x)/p(x)=2p(x)logr(x) p(x)2logr(x) = p(x)r(x) =Pr(X=X)

    I f i Th 60

    x

    *Notice that, the function f(y)=2y is convex