22
Lecture 6: Entropy Rate Entropy rate H (X) Random walk on graph Dr. Yao Xie, ECE587, Information Theory, Duke University

Lecture 6: Entropy Rateyxie77/ece587/Lecture6.pdf · Entropy rate of random walk on graph Let Wi = ∑ j Wij; W = ∑ i;j:i>j Wij Stationary distribution is i = Wi 2W Can verify this

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • Lecture 6: Entropy Rate

    • Entropy rate H(X)

    • Random walk on graph

    Dr. Yao Xie, ECE587, Information Theory, Duke University

  • Coin tossing versus poker

    • Toss a fair coin and see and sequence

    Head, Tail, Tail, Head · · ·

    (x1, x2, . . . , xn) ≈ 2−nH(X)

    • Play card games with friend and see a sequence��

    A ♣��

    K r��

    Q q��

    J ♠��

    10 ♣ · · ·

    (x1, x2, . . . , xn) ≈?

    Dr. Yao Xie, ECE587, Information Theory, Duke University 1

  • How to model dependence: Markov chain

    • A stochastic process X1, X2, · · ·

    – State {X1, . . . , Xn}, each state Xi ∈ X– Next step only depends on the previous state

    p(xn+1|xn, . . . , x1) = p(xn+1|xn).

    – Transition probability

    pi, j : the transition probability of i→ j

    – p(xn+1) =∑

    xn p(xn)p(xn+1|xn)– p(x1, x2, · · · , xn) = p(x1)p(x2|x1) · · · p(xn|xn−1)

    Dr. Yao Xie, ECE587, Information Theory, Duke University 2

  • Hidden Markov model (HMM)

    • Used extensively in speech recognition, handwriting recognition,machine learning.

    • Markov process X1, X2, . . . , Xn, unobservable

    • Observe a random process Y1,Y2, . . . , Yn, such that

    Yi ∼ p(yi|xi)

    • We can build a probability model

    p(xn, yn) = p(x1)n−1∏i=1

    p(xi+1|xi)n∏

    i=1

    p(yi|xi)

    Dr. Yao Xie, ECE587, Information Theory, Duke University 3

  • Time invariance Markov chain

    • A Markov chain is time invariant if the conditional probability p(xn|xn−1)does not depend on n

    p(Xn+1 = b|Xn = a) = p(X2 = b|X1 = a), for all a, b ∈ X

    • For this kind of Markov chain, define transition matrix

    P =

    P11 · · · P1n· · ·Pn1 · · · Pnn

    Dr. Yao Xie, ECE587, Information Theory, Duke University 4

  • Simple weather model

    • X = { Sunny: S, Rainy R }

    • p(S|S) = 1 − β, p(R|R) = 1 − α, p(R|S) = β, p(S|R) = α

    P =[1 − β βα 1 − α

    ]

    !"

    #"

    $%!" $%#"

    "

    Dr. Yao Xie, ECE587, Information Theory, Duke University 5

  • • Probability of seeing a sequence SSRR:

    p(SSRR) = p(S)p(S|S)p(R|S)p(R|R) = p(S)(1 − β)β(1 − α)

    What will this sequence behave, after many days of observations?

    • What sequences of observations are more typical?

    • What is the probability of seeing a typical sequence?

    Dr. Yao Xie, ECE587, Information Theory, Duke University 6

  • Stationary distribution

    • Stationary distribution: a distribution µ on the states such that thedistribution at time n + 1 is the same as the distribution at time n.

    • Our weather example:– If µ(S) = α

    α+β, µ(R) = β

    α+β

    P =[1 − β βα 1 − α

    ]– Then

    p(Xn+1 = S) = p(S|S)µ(S) + p(S|R)µ(R)

    = (1 − β) αα + β

    + αβ

    α + β=α

    α + β= µ(S ).

    Dr. Yao Xie, ECE587, Information Theory, Duke University 7

  • • How to calculate stationary distribution

    – Stationary distribution µi, i = 1, · · · , |X| satisfies

    µi =∑

    j

    µ jp ji, (µ = µP), and|X|∑i=1

    µi = 1.

    – “Detailed balancing”:

    !"

    #!"

    $%!"#%"

    $&!"#&"

    Dr. Yao Xie, ECE587, Information Theory, Duke University 8

  • Stationary process

    • A stochastic process is stationary if the joint distribution of any subset isinvariant to time-shift

    p(X1 = x1, · · · , Xn = xn) = p(X2 = x1, · · · , Xn+1 = xn).

    • Example: coin tossing

    p(X1 = head, X2 = tail) = p(X2 = head, X3 = tail) = p(1 − p).

    Dr. Yao Xie, ECE587, Information Theory, Duke University 9

  • Entropy rate

    • When Xi are i.i.d., entropy H(Xn) = H(X1, · · · , Xn) =∑n

    i=1 H(Xi) = nH(X)

    • With dependent sequence Xi, how does H(Xn) grow with n? Still linear?

    • Entropy rate characterizes the growth rate

    • Definition 1: average entropy per symbol

    H(X) = limn→∞

    H(Xn)n

    • Definition 2: rate of information innovation

    H′(X) = limn→∞

    H(Xn|Xn−1, · · · , X1)

    Dr. Yao Xie, ECE587, Information Theory, Duke University 10

  • • H′(X) exists, for Xi stationary

    H(Xn|X1, · · · , Xn−1) ≤ H(Xn|X2, · · · , Xn−1) (1)≤ H(Xn−1|X1, · · · , Xn−2) (2)

    – H(Xn|X1, · · · , Xn−1) decreases as n increases– H(X) ≥ 0– The limit must exist

    Dr. Yao Xie, ECE587, Information Theory, Duke University 11

  • • H(X) = H′(X), for Xi stationary

    1n

    H(X1, · · · , Xn) =1n

    n∑i=1

    H(Xi|Xi−1, · · · , X1)

    • Each H(Xn|X1, · · · , Xn−1)→ H′(X)

    • Cesaro mean:

    If an → a, bn = 1n∑n

    i=1 ai, bi → a, then bn → a.

    • So1n

    H(X1, · · · , Xn)→ H′(X)

    Dr. Yao Xie, ECE587, Information Theory, Duke University 12

  • AEP for stationary process

    −1n

    log p(X1, · · · , Xn)→ H(X)

    • p(X1, · · · , Xn) ≈ 2−nH(X)

    • Typical sequences in typical set of size 2−nH(X)

    • We can use nH(X) bits to represent typical sequence

    Dr. Yao Xie, ECE587, Information Theory, Duke University 13

  • Entropy rate for Markov chain

    • For Markov chain

    H(X) = lim H(Xn|Xn−1, · · · , X1) = lim H(Xn|Xn−1) = H(X2|X1)

    • By definitionp(X2 = j|X1 = i) = Pi j

    • Entropy rate of Markov chain

    H(X) = −∑

    i j

    µiPi j log Pi j

    Dr. Yao Xie, ECE587, Information Theory, Duke University 14

  • Calculate entropy rate is fairly easy

    1. Find stationary distribution µi

    2. Use transition probability Pi j

    H(X) = −∑

    i j

    µiPi j log Pi j

    Dr. Yao Xie, ECE587, Information Theory, Duke University 15

  • Entropy rate of weather modelStationary distribution µ(S) = α

    α+β, µ(R) = β

    α+β

    P =[1 − β βα 1 − α

    ]

    H(X) = βα + β

    [α logα + (1 − α) log(1 − α)]

    α + βH(β) +

    β

    α + βH(α)

    ≤ H(2αβ

    α + β

    )≤ H

    ( √αβ

    )Maximum when α = β = 1/2: degenerate to independent process

    Dr. Yao Xie, ECE587, Information Theory, Duke University 16

  • Random walk on graph

    • An undirected graph with m nodes {1, . . . ,m}

    • Edge i→ j has weight Wi j ≥ 0 (Wi j = W ji)

    • A particle walks randomly from node to node

    • Random walk X1, X2, · · · : a sequence of vertices

    • Given Xn = i, next step chosen from neighboring nodes with probability

    Pi j =Wi j∑k Wik

    Dr. Yao Xie, ECE587, Information Theory, Duke University 17

  • Dr. Yao Xie, ECE587, Information Theory, Duke University 18

  • Entropy rate of random walk on graph

    • LetWi =

    ∑j

    Wi j, W =∑

    i, j:i> j

    Wi j

    • Stationary distribution isµi =

    Wi2W

    • Can verify this is a stationary distribution: µP = µ

    • Stationary distribution ∝ weight of edges emanating from node i(locality)

    Dr. Yao Xie, ECE587, Information Theory, Duke University 19

  • Dr. Yao Xie, ECE587, Information Theory, Duke University 20

  • Summary

    • AEP Stationary process X1, X2, · · · , Xi ∈ X: as n→ ∞

    p(x1, · · · , xn) ≈ 2−nH(X)

    • Entropy rate

    H(X) = limn→∞

    H(Xn|Xn−1, . . . , X1) =1n

    limn→∞

    H(X1, . . . , Xn)

    • Random walk on graphµi =

    Wi2W

    Dr. Yao Xie, ECE587, Information Theory, Duke University 21