27
Information Theory Rong Jin

Information Theory Rong Jin. Outline Information Entropy Mutual information Noisy channel model

  • View
    222

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Information Theory

Rong Jin

Page 2: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Outline Information Entropy Mutual information Noisy channel model

Page 3: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Information Information knowledge Information: reduction in uncertainty Example:

1. flip a coin

2. roll a die #2 is more uncertain than #1 Therefore, more information is provided by the

outcome of #2 than #1

Page 4: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Definition of Information Let E be some event that occurs with

probability P(E). If we are told that E has occurred, then we say we have received I(E)=log2(1/P(E)) bits of information

Example: Result of a fair coin flip (log22=1 bit)

Result of a fair die roll (log26=2.585 bits)

Page 5: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Information is Additive I(k fair coin tosses) = log2k =k bits Example: information conveyed by words

Random word from a 100,000 word vocabulary: I(word) = log(100,000) = 16.6 bits

A 1000 word document from the same source I(document) = 16,600 bits

A 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits

A picture is worth a 1000 words!

Page 6: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Information is Additive I(k fair coin tosses) = log2k =k bits Example: information conveyed by words

Random word from a 100,000 word vocabulary: I(word) = log(100,000) = 16.6 bits

A 1000 word document from the same source I(document) = 16,600 bits

A 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits

A picture is worth a 1000 words!

Page 7: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Information is Additive I(k fair coin tosses) = log2k =k bits Example: information conveyed by words

Random word from a 100,000 word vocabulary: I(word) = log(100,000) = 16.6 bits

A 1000 word document from the same source I(document) = 16,600 bits

A 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits

A picture is worth a 1000 words!

Page 8: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Information is Additive I(k fair coin tosses) = log2k =k bits Example: information conveyed by words

Random word from a 100,000 word vocabulary: I(word) = log(100,000) = 16.6 bits

A 1000 word document from the same source I(document) = 16,600 bits

A 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits

A picture is worth more than a 1000 words!

Page 9: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Outline Information Entropy Mutual Information Cross Entropy and Learning

Page 10: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Entropy A zero-memory information source S is a source that emits

symbols from an alphabet {s1, s2,…, sk} with probability {p1, p2,…,pk}, respectively, where the symbols emitted are statistically independent.

What is the average amount of information in observing the output of the source S?

Call this entropy:

( )~1 1

( ) ( ) log [log ]( )i i i p s P

ii i

H s p I s p Ep p s

Page 11: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Entropy A zero-memory information source S is a source that emits

symbols from an alphabet {s1, s2,…, sk} with probability {p1, p2,…,pk}, respectively, where the symbols emitted are statistically independent.

What is the average amount of information in observing the output of the source S?

Call this entropy:

( )~1 1

( ) ( ) log [log ]( )i i i p s P

ii i

H s p I s p Ep p s

Page 12: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Explanation of Entropy1

( ) logiii

H P pp

1. Average amount of information provided per symbol

2. Average # of bits needed to communicate each symbol

Page 13: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Properties of Entropy

1. Non-negative: H(P) 0

2. For any other probability distribution {q1,…,qk},

3. H(P) logk, with equality iff pi=1/k for all i

4. The further P is from uniform, the lower the entropy.

1( ) logi

ii

H P pp

1 1( ) log logi i

i ii i

H P p pp q

Page 14: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Entropy: k = 2

1 1( ) log (1 ) log

1H P p p

p p

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Notice

• zero information at edges

• maximum information at 0.5 (1 bit)

• drop off more quickly close edges than in the middle

Page 15: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

The Entropy of English 27 characters (A-Z, space) 100,000 words (average 6.5 char each) Assuming independence between successive

characters: Uniform character distribution: log27 = 4.75 bits/char True character distribution: 4.03 bits/character

Assuming independence between successive words: Uniform word distribution: log100,1000/6.5 = 2.55

bits/char True word distribution: 9.45/6.5 = 1.45 bits/character

True entropy of English is much lower!

Page 16: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

The Entropy of English 27 characters (A-Z, space) 100,000 words (average 6.5 char each) Assuming independence between successive

characters: Uniform character distribution: log27 = 4.75 bits/char True character distribution: 4.03 bits/character

Assuming independence between successive words: Uniform word distribution: (log100,1000)/6.5 = 2.55

bits/char True word distribution: 9.45/6.5 = 1.45 bits/character

True entropy of English is much lower!

Page 17: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Entropy of Two Sources

Temperature T

P(T = hot) = 0.3

P(T = mild) = 0.5

P(T = cold) = 0.2

H(T) = H(0.3, 0.5, 0.2) = 1.485

Humidity M

P(M = low) = 0.6

P(M = high) = 0.4

H(M) = H(0.6, 0.4) = 0.971

Random variable T, M are not independent

• P(T=t, M=m)P(T=t)P(M=m)

Page 18: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

• H(T) = 1.485

• H(M) = 0.971

• H(T) + H(M) = 2.456

• Joint Entropy

• H(T, M) = H(0.1, 0.4, 0.1, 0.2, 0.1, 0.1, 0.1) = 2.321

• H(T, M) H(T) + H(M)

Joint Entropy

Joint Probability P(T, M)

Page 19: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Conditional Entropy Conditional Entropy

H(T|M = low) = 1.252 H(T|M = high) = 1.5 Average conditional entropy

How much is M telling us on average about T?

H(T) – H(T|M) = 1.485 – 1.351 = 0.134 bits

( | ) ( ) ( | )

0.4 1.251 0.6 1.5 1.351m

H T M P M m H T M m

Conditional Probability P(T| M)

Page 20: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Mutual Information

Properties: Indicate the amount of information one random variable can

provide to another one Symmetric I(X;Y) = I(Y;X) Non-negative Zero iff X, Y are independent

,

,

( ; ) ( ) ( | )

1 1( ) log ( , ) log

( ) ( | )

( , )( , ) log

( ) ( )

x x y

x y

I X Y H X H X Y

P x P x yP x P x y

P x yP x y

P x P y

Page 21: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Relationship

H(X, Y)

H(X)

H(Y)

H(X|Y) H(Y|X)I(X;Y)

Page 22: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

A Distance Measure Between Distributions

Kullback-Leibler distance:

Properties of Kullback-Leibler distance Non-negative: KL(PD||PM)=0 iff PD= PM

Minimizing KL distance PM get close to PD

Non-symmetric: KL(PD||PM) KL(PM||PD)

~( ) ( )

( || ) ( ) log [log ]( ) ( )D

D DD M D x Px

M M

P x P xKL P P P x E

P x P x

Page 23: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Bregman Distance

' (x) is a convex function.

Page 24: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Compression Algorithm for TC

Sports

Training Examples

Politics

Compress

109K

116K

New Document

Page 25: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

Compression Algorithm for TC

Sports

Training Examples

Politics

Compress

109K

116K

New Document

Politics

New Document

Sports

Compress

129K

126K

Topic:

Sports

Page 26: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

26

The Noisy Channel Prototypical case: Input Output (noisy) The channel 0,1,1,1,0,1,0,1,... (adds noise) 0,1,1,0,0,1,1,0,...

Model: probability of error (noise): Example: p(0|1) = .3 p(1|1) = .7 p(1|0) = .4 p(0|0) = .6 The Task: known: the noisy output; want to know: the input (decoding) Source coding theorem Channel coding theorem

Page 27: Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model

27

Noisy Channel Applications OCR

straightforward: text print (adds noise), scan image

Handwriting recognition text neurons, muscles (“noise”), scan/digitize image

Speech recognition (dictation, commands, etc.) text conversion to acoustic signal (“noise”) acoustic waves

Machine Translation text in target language translation (“noise”) source language