85
Foundations of Probabilities and Information Theory for Machine Learning 0.

Foundations of Probabilities

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Foundations of Probabilities

Foundations of Probabilities

and Information Theory

for Machine Learning

0.

Page 2: Foundations of Probabilities

Random Variables

Some proofs

1.

Page 3: Foundations of Probabilities

E[X + Y ] = E[X] + E[Y ]where X and Y are random variables of the same type (i.e. either discrete or cont.)

The discrete case:

E[X + Y ] =∑

ω∈Ω

(X(ω) + Y (ω)) · P (ω)

=∑

ω

X(ω) · P (ω) +∑

ω

Y (ω) · P (ω) = E[X ] + E[Y ]

The continuous case:

E[X + Y ] =

x

y

(x+ y)pXY (x, y)dydx

=

x

y

xpXY (x, y)dydx+

x

y

ypXY (x, y)dydx

=

x

x

y

pXY (x, y)dydx+

y

y

x

pXY (x, y)dxdy

=

x

xpX(x)dx +

y

ypY (y)dy = E[X ] + E[Y ]

2.

Page 4: Foundations of Probabilities

X and Y are independent ⇒ E[XY ] = E[X] · E[Y ],X and Y being random variables of the same type (i.e. either discrete or continuous)

The discrete case:

E[XY ] =∑

x∈V al(X)

y∈V al(Y )

xyP (X = x, Y = y) =∑

x∈V al(X)

y∈V al(Y )

xyP (X = x) · P (Y = y)

=∑

x∈V al(X)

xP (X = x)∑

y∈V al(Y )

yP (Y = y)

=∑

x∈V al(X)

xP (X = x)E[Y ] = E[X ] · E[Y ]

The continuous case:

E[XY ] =

x

y

xy p(X = x, Y = y)dydx =

x

y

xy p(X = x) · p(Y = y)dydx

=

x

x p(X = x)

(∫

y

y p(Y = y)dy

)

dx =

x

x p(X = x)E[Y ]dx

= E[Y ] ·∫

x

x p(X = x)dx = E[X ] ·E[Y ]

3.

Page 5: Foundations of Probabilities

Binomial distribution: b(r;n, p)def.= Cr

n pr(1− p)n−r

Significance: b(r;n, p) is the probability of drawing r heads in n independentflips of a coin having the head probability p.

b(r;n, p) indeed represents a probability distribution:

• b(r;n, p) = Crn p

r(1− p)n−r ≥ 0 for all p ∈ [0, 1], n ∈ N and r ∈ 0, 1, . . . , n,•

∑n

r=0 b(r;n, p) = 1:

(1− p)n + C1np(1− p)n−1 + · · ·+ Cn−1

n pn−1(1− p) + pn = [p + (1− p)]n = 1

4.

Page 6: Foundations of Probabilities

Binomial distribution: calculating the mean

E[b(r;n, p)]def.=

n∑

r=0

r · b(r;n, p) =

= 1 · C1np(1− p)n−1 + 2 · C2

np2(1− p)n−2 + · · ·+ (n− 1) · Cn−1

n pn−1(1− p) + n · pn= p

[C1

n(1− p)n−1 + 2 · C2np(1− p)n−2 + · · ·+ (n− 1) · Cn−1

n pn−2(1− p) + n · pn−1]

= np[(1− p)n−1 + C1

n−1p(1− p)n−2 + · · ·+ Cn−2n−1p

n−2(1− p) + Cn−1n−1p

n−1]

(1)

= np[p + (1− p)]n−1 = np

For the (1) equality we used the following property:

k Ckn = k

n!

k! (n− k)! =n!

(k − 1)! (n− k)! =n (n− 1)!

(k − 1)! (n− 1− (k − 1))!

= n Ck−1n−1, ∀k = 1, . . . , n.

5.

Page 7: Foundations of Probabilities

Binomial distribution: calculating the variancefollowing www.proofwiki.org/wiki/Variance of Binomial Distribution, which cites

“Probability: An Introduction”, by Geoffrey Grimmett and Dominic Welsh,

Oxford Science Publications, 1986

We will make use of the formula Var[X ] = E[X2]−E2[X ].By denoting q = 1− p, it follows:

E[b2(r;n, p)]def.=

n∑

r=0

r2Crnp

rqn−r =

n∑

r=0

r2n(n− 1) . . . (n− r + 1)

r!prqn−r

=

n∑

r=1

rn(n− 1) . . . (n− r + 1)

(r − 1)!prqn−r =

n∑

r=1

rnCr−1n−1 p

rqn−r

= npn∑

r=1

r Cr−1n−1 p

r−1q(n−1)−(r−1)

6.

Page 8: Foundations of Probabilities

Binomial distribution: calculating the variance (cont’d)

By denoting j = r − 1 and m = n− 1, we’ll get:

E[b2(r;n, p)] = npm∑

j=0

(j + 1)Cjm pjqm−j

= np

m∑

j=0

j Cjm pjqm−j +

m∑

j=0

Cjm pjqm−j

= np

m∑

j=0

jm · . . . · (m− j + 1)

j!pjqm−j + (p+ q

︸ ︷︷ ︸1

)m

= np

m∑

j=1

mCj−1m−1 p

jqm−j + 1

= np

mp

m∑

j=1

Cj−1m−1 p

j−1q(m−1)−(j−1) + 1

= np[(n− 1)p(p+ q︸ ︷︷ ︸

1

)m−1 + 1] = np[(n− 1)p+ 1] = n2p2 − np2 + np

Finally,

Var[X ] = E[b2(r;n, p)] − (E[b(r;n, p)])2 = n2p2 − np2 + np− n2p2 = np(1− p)

7.

Page 9: Foundations of Probabilities

Binomial distribution: calculating the variance

Another solution

• se demonstreaza relativ usor ca orice variabila aleatoare urmanddistributia binomiala b(r;n, p) poate fi vazuta ca o suma de n vari-abile independente care urmeaza distributia Bernoulli de parametrup;a

• stim (sau, se poate dovedi imediat) ca varianta distributiei Bernoullide parametru p este p(1− p);

• tinand cont de proprietatea de liniaritate a variantelor — Var[X1 +X2 . . . + Xn] = Var[X1] + Var[X2] . . . + Var[Xn], daca X1, X2, . . . , Xn suntvariabile independente —, rezulta ca Var[X ] = np(1− p).

aVezi www.proofwiki.org/wiki/Bernoulli Process as Binomial Distribution, care citeaza de asemenea ca sursa “Proba-bility: An Introduction” de Geoffrey Grimmett si Dominic Welsh, Oxford Science Publications, 1986.

8.

Page 10: Foundations of Probabilities

The Gaussian distribution: p(X = x) = 1√2πσe− (x− µ)2

2σ2

Calculating the mean: E[Nµ,σ(x)]def.=∫∞−∞ xp(x)dx = 1√

2πσ

∫∞−∞ x · e

−(x− µ)2

2σ2 dx

Using the variable transformation v =x− µσ

will imply x = σv + µ and dx = σdv, so:

E[X ] =1√2πσ

∫ ∞

−∞(σv + µ)e

−v2

2 (σdv) =σ√2πσ

σ

∫ ∞

−∞ve

−v2

2 dv + µ

∫ ∞

−∞e−v2

2 dv

=1√2π

−σ

∫ ∞

−∞(−v)e

−v2

2 dv + µ

∫ ∞

−∞e−v2

2 dv

=

1√2π

−σ e−v2

2

∣∣∣∣∣∣∣

−∞︸ ︷︷ ︸

=0

∫ ∞

−∞e−v2

2 dv

=µ√2π

∫ ∞

−∞e−v2

2 dv (see the next slide for the computation on this last integral)

=µ√2π

√2π = µ

9.

Page 11: Foundations of Probabilities

The Gaussian distribution: calculating the mean (Cont’d)

v=−∞

e−

v2

2 dv

2

=

x=−∞

e−

x2

2 dx

·

y=−∞

e−

y2

2 dy

=

x=−∞

y=−∞

e−

x2 + y2

2 dydx

=

∫∫

R2

e−

x2 + y2

2 dydx

By switching from x, y to polar coordinates r, θ (see the Note below), it follows:

v=−∞

e−

v2

2 dv

2

=

r=0

θ=0

e−

r2

2 (rdrdθ) =

r=0

re−

r2

2

(∫

θ=0

)

dr =

r=0

re−

r2

2 θ|2π0 dr

= 2π

r=0

re−

r2

2 dr = 2π (−e−

r2

2 )

0

= 2π(0− (−1)) = 2π ⇒∫

v=−∞

e−

v2

2 dv =√2π.

Note: x = r cos θ and y = r sin θ, with r ≥ 0 and θ ∈ [0, 2π). Therefore, x2 + y2 = r2, and theJacobian matrix is

∂(x, y)

∂(r, θ)=

∂x

∂r

∂x

∂θ

∂y

∂r

∂y

∂θ

=

cos θ −r sin θ

sin θ r cos θ

= r cos2 θ + r sin2 θ = r ≥ 0. So, dxdy = rdrdθ.

10.

Page 12: Foundations of Probabilities

Bivariate Gaussian

11.

Page 13: Foundations of Probabilities

The Gaussian distribution: calculating the variance

We will make use of the formula Var[X ] = E[X2]− E2[X ].

E[X2] =

∫ ∞

−∞x2p(x)dx =

1√2πσ

∫ ∞

−∞x2 · e

−(x− µ)2

2σ2 dx

Again, using the transformation v =x− µσ

will imply x = σv+µ and dx = σdv. Therefore,

E[X2] =1√2πσ

∫ ∞

−∞(σv + µ)2 e

−v2

2 (σdv)

=σ√2πσ

∫ ∞

−∞(σ2v2 + 2σµv + µ2) e

−v2

2 dv

=1√2π

σ2

∫ ∞

−∞v2 e

−v2

2 dv + 2σµ

∫ ∞

−∞v e

−v2

2 dv + µ2

∫ ∞

−∞e−v2

2 dv

Note that we have already computed∫∞−∞ ve

−v2

2 dv = 0 and∫∞−∞ e

−v2

2 dv =√2π.

12.

Page 14: Foundations of Probabilities

The Gaussian distribution: calculating the variance (Cont’d)

Therefore, we only need to compute

∫ ∞

−∞v2e

−v2

2 dv =

∫ ∞

−∞(−v)

−ve

−v2

2

dv =

∫ ∞

−∞(−v)

e

−v2

2

dv

= (−v) e−v2

2

∣∣∣∣∣∣∣

−∞

−∫ ∞

−∞(−1)e

−v2

2 dv = 0 +

∫ ∞

−∞e−v2

2 dv =√2π.

Here above we used the fact that

limv→∞

v e−v2

2 = limv→∞

v

e

v2

2

l’Hopital= lim

v→∞1

v e

v2

2

= 0 = limv→−∞

v e−v2

2

So, E[X2] = 1√2π

(σ2√2π + 2σµ · 0 + µ2

√2π)= σ2 + µ2.

And, finally, Var[X ] = E[X2]− (E[X ])2 = (σ2 + µ2)− µ2 = σ2.

13.

Page 15: Foundations of Probabilities

Vectors of random variables.

A property:

The covariance matrix Σ corresponding to such a vector issymmetric and positive semi-definite

Chuong Do, Stanford University, 2008

[adapted by Liviu Ciortuz]

14.

Page 16: Foundations of Probabilities

Fie variabilele aleatoare X1, . . . , Xn, cu Xi : Ω → R pentrui = 1, . . . , n. Matricea de covarianta a vectorului de variabilealeatoare X = (X1, . . . , Xn) este o matrice patratica de dimensiune

n×n, ale carei elemente se definesc astfel: [Cov(X)]ijdef.= Cov(Xi, Xj),

pentru orice i, j ∈ 1, . . . , n.

Aratati ca Σnot.= Cov(X) este matrice simetrica si pozitiv semi-

definita, cea de-a doua proprietate ınsemnand ca pentru oricevector z ∈ Rn are loc inegalitatea z⊤Σz ≥ 0. (Vectorii z ∈ Rn suntconsiderati vectori-coloana, iar simbolul ⊤ reprezinta operatia detranspunere de matrice.)

15.

Page 17: Foundations of Probabilities

Cov(X)i,jdef.= Cov(Xi, Xj), for all i, j ∈ 1, . . . , n, and

Cov(Xi, Xj)def.= E[(Xi − E[Xi])(Xj − E[Xj ])] = E[(Xj − E[Xj ])(Xi − E[Xi])] = Cov(Xj , Xi),

therefore Cov(X) is a symmetric matrix.

We will show that zTΣz ≥ 0 for any z ∈ Rn (seen as a column-vector):

zTΣz =

n∑

i=1

zi(

n∑

j=1

Σijzj) =

n∑

i=1

n∑

j=1

(ziΣijzj) =

n∑

i=1

n∑

j=1

(ziCov[Xi, Xj] zj)

=

n∑

i=1

n∑

j=1

(zi E[(Xi − E[Xi])(Xj − E[Xj])] zj) = E

n∑

i=1

n∑

j=1

zi (Xi − E[Xi])(Xj − E[Xj ]) zj

= E

(n∑

i=1

zi (Xi − E[Xi])

)

n∑

j=1

(Xj − E[Xj ]) zj

= E

(n∑

i=1

(Xi − E[Xi]) zi

)

n∑

j=1

(Xj − E[Xj ]) zj

= E[((X − E[X ])T · z)2] ≥ 0

16.

Page 18: Foundations of Probabilities

Multi-variate Gaussian distributions:

A property:

When the covariance matrix of a multi-variate (d-dimensional) Gaussian

distribution is diagonal, then the p.d.f. (probability density function) of

the respective multi-variate Gaussian is equal to the product of d

independent uni-variate Gaussian densities.

Chuong Do, Stanford University, 2008

[adapted by Liviu Ciortuz]

17.

Page 19: Foundations of Probabilities

Let’s consider X = [X1 . . . Xd]T , µ ∈ Rd and Σ ∈ Sd+, where Sd+ is the set of symmetric

positive definite matrices (which implies |Σ| 6= 0 and (x − µ)TΣ−1(x − µ) > 0, therefore− 1

2 (x− µ)TΣ−1(x− µ) < 0, for any x ∈ Sd, x 6= µ).

The probability density function of a multi-variate Gaussian distribution of parametersµ and Σ is:

p(x;µ,Σ) =1

(2π)d/2|Σ|1/2 exp

(

−1

2(x − µ)TΣ−1(x − µ)

)

,

Notation: X ∼ N (µ,Σ).Show that when the covariance matrice Σ is diagonal, then the p.d.f. (probability den-sity function) of the respective multi-variate Gaussian is equal to the product of dindependent uni-variate Gaussian densities.

We will make the proof for d = 2(generalization to d > 2 will be easy):

x =

[x1x2

]

µ =

[µ1

µ2

]

Σ =

[σ21 0

0 σ22

]

Note: It is easy to show that if Σ ∈ Sd+ is diagonal, the

elements on the principal diagonal Σ are indeed strictly

positive. (It is enough to consider z = (1, 0) and respec-

tively z = (0, 1) in formula for pozitive-definiteness of Σ.)

This is why we wrote these elements of σ as σ21 and σ2

2 .

18.

Page 20: Foundations of Probabilities

19.

Page 21: Foundations of Probabilities

p(x;µ,Σ) =1

∣∣∣∣

σ21 0

0 σ22

∣∣∣∣

12

exp

(

−1

2

[x1 − µ1

x2 − µ2

]T [σ21 0

0 σ22

]−1 [x1 − µ1

x2 − µ2

])

=1

2π σ1σ2exp

−1

2

[x1 − µ1

x2 − µ2

]T

1

σ21

0

01

σ22

[x1 − µ1

x2 − µ2

]

=1

2π σ1σ2exp

−1

2

[x1 − µ1

x2 − µ2

]T

1

σ21

(x1 − µ1)

1

σ22

(x2 − µ2)

=1

2π σ1σ2exp

(

− 1

2σ21

(x1 − µ1)2 − 1

2σ22

(x2 − µ2)2

)

= p(x1;µ1, σ21) p(x2;µ2, σ

22).

20.

Page 22: Foundations of Probabilities

Bi-variate Gaussian distributions. A property:

The conditional distributions X1|X2 and X2|X1 are alsoGaussians.

The calculation of their parameters

Duda, Hart and Stork, Pattern Classification, 2001,Appendix A.5.2

[adapted by Liviu Ciortuz]

21.

Page 23: Foundations of Probabilities

Fie X o variabila aleatoare care urmeaza o distributie gaussiana bi-variatade parametri µ (vectorul de medii) si Σ (matricea de covarianta). Asadar,µ = (µ1, µ2) ∈ R2, iar Σ ∈M2×2(R).

Prin definitie, Σ = Cov(X,X), unde Xnot.= (X1, X2), asadar Σij = Cov(Xi, Xj)

pentru i, j ∈ 1, 2. De asemenea, Cov(Xi, Xi) = Var[Xi]not.= σ2

i ≥ 0 pentru

i ∈ 1, 2, ın vreme ce pentru i 6= j avem Cov(Xi, Xj) = Cov(Xj , Xi)not.= σij.

In sfarsit, daca introducem ,,coeficientul de corelare“ ρdef.=

σ12σ1σ2

, rezulta ca

putem scrie astfel matricea de covarianta:

Σ =

[σ21 ρσ1σ2

ρσ1σ2 σ22

]

. (2)

22.

Page 24: Foundations of Probabilities

Demonstrati ca ipoteza X ∼ N (µ,Σ), im-plica faptul ca distributia conditionalaX2|X1 este de tip gaussian, si anume

X2|X1 = x1 ∼ N (µ2|1, σ22|1),

cu µ2|1 = µ2+ρσ2σ1

(x1−µ1) si σ22|1 = σ2

2(1−ρ2).

Observatie: Pentru X1|X2, rezultatuleste similar: X1|X2 = x2 ∼ N (µ1|2, σ

21|2), cu

µ1|2 = µ1 + ρσ1σ2

(x2 − µ2) si σ21|2 = σ2

1(1− ρ2).

Source:

Pattern Classification, Appendix A.5.2,

Duda, Hart and Stork, 2001

23.

Page 25: Foundations of Probabilities

Answer

pX2|X1(x2|x1) def.

=pX1,X2

(x1, x2)

pX1(x1)

, (3)

where

pX1,X2(x1, x2) =

1

(√2π)2

|Σ|exp

(

−1

2(x− µ)⊤Σ−1(x− µ)

)

si

pX1(x1) =

1√2πσ1

exp

(

− 1

2σ21

(x1 − µ1)2

)

. (4)

From (2) it follows that |Σ| = σ21σ

22(1 − ρ2). In order that

|Σ| and Σ−1 be defined, it

follows that ρ ∈ (−1, 1). Moreover, since σ1, σ2 > 0, we will have√

|Σ| = σ1σ2√

1− ρ2.

Σ−1 =1

σ21σ

22(1− ρ2)

Σ∗ =1

σ21σ

22(1 − ρ2)

[σ22 −ρσ1σ2

−ρσ1σ2 σ21

]

=1

(1 − ρ2)

1

σ21

− ρ

σ1σ2

− ρ

σ1σ2

1

σ22

24.

Page 26: Foundations of Probabilities

So,

pX1,X2(x1, x2) =

=1

2πσ1σ2√

1− ρ2exp

− 1

2(1− ρ)2 (x1 − µ1, x2 − µ2)

1

σ21

− ρ

σ1σ2

− ρ

σ1σ2

1

σ22

(x1 − µ1

x2 − µ2

)

=1

2πσ1σ2√

1− ρ2·

exp

(

− 1

2(1− ρ2)

[(x1 − µ1

σ1

)2

− 2ρ

(x1 − µ1

σ1

)(x2 − µ2

σ2

)

+

(x2 − µ2

σ2

)2])

(5)

25.

Page 27: Foundations of Probabilities

By substitution (4) and (5) in the definition (3), we will get:

p(x2|x1) =pX1,X2

(x1, x2)

pX1(x1)

=1

2πσ1σ2√

1− ρ2· exp

(

− 1

2(1− ρ2)

[(x1 − µ1

σ1

)2

− 2ρ

(x1 − µ1

σ1

)(x2 − µ2

σ2

)

+

(x2 − µ2

σ2

)2])

·√2πσ1 exp

(

1

2

(x1 − µ1

σ1

)2)

=1√

2πσ2√

1− ρ2exp

[

− 1

2(1− ρ2)

(x2 − µ2

σ2− ρx1 − µ1

σ1

)2]

=1√

2πσ2√

1− ρ2exp

−1

2

x2 − [µ2 + ρσ2σ1

(x1 − µ1)]

σ2√

1− ρ2

2

Therefore,

X2|X1 = x1 ∼ N (µ2|1, σ22|1) with µ2|1

not.= µ2 + ρ

σ2σ1

(x1 − µ1) and σ22|1

not.= σ2

2(1− ρ2).

26.

Page 28: Foundations of Probabilities

Using the Central Limit Theorem (the i.i.d. version)

to compute the real error of a classifier

CMU, 2008 fall, Eric Xing, HW3, pr. 3.3

27.

Page 29: Foundations of Probabilities

Chris recently adopts a new (binary) classifier to filter emailspams. He wants to quantitively evaluate how good the classi-fier is.

He has a small dataset of 100 emails on hand which, you canassume, are randomly drawn from all emails.

He tests the classifier on the 100 emails and gets 83 classifiedcorrectly, so the error rate on the small dataset is 17%.

However, the number on 100 samples could be either higher orlower than the real error rate just by chance.

With a confidence level of 95%, what is likely to be the range ofthe real error rate? Please write down all important steps.

(Hint: You need some approximation in this problem.)

28.

Page 30: Foundations of Probabilities

Notations:

Let Xi, i = 1, . . . , n = 100 be defined as:Xi = 1 if the email i was incorrectly classified, and 0 otherwise;

E[Xi]not.= µ

not.= ereal ; Var(Xi)

not.= σ2

esamplenot.=

X1 + . . .+Xn

n= 0.17

Zn =X1 + . . .+Xn − nµ√

n σ(the standardized form of X1 + . . .+Xn)

Key insight:

Calculating the real error of the classifier (more exactly, a symmetric interval

around the real error pnot.= µ) with a “confidence” of 95% amounts to finding a > 0

sunch that P (|Zn| ≤ a) ≥ 0.95.

29.

Page 31: Foundations of Probabilities

Calculus:

| Zn |≤ a ⇔∣∣∣∣

X1 + . . .+Xn − nµ√n σ

∣∣∣∣≤ a ⇔

∣∣∣∣

X1 + . . .+Xn − nµnσ

∣∣∣∣≤ a√

n

⇔∣∣∣∣

X1 + . . .+Xn − nµn

∣∣∣∣≤ aσ√

n⇔∣∣∣∣

X1 + . . .+Xn

n− µ

∣∣∣∣≤ aσ√

n

⇔ |esample − ereal| ≤aσ√n⇔ |ereal − esample| ≤

aσ√n

⇔ − aσ√n≤ ereal − esample ≤

aσ√n

⇔ esample −aσ√n≤ ereal ≤ esample +

aσ√n

⇔ ereal ∈[

esample −aσ√n, esample +

aσ√n

]

30.

Page 32: Foundations of Probabilities

Important facts:

The Central Limit Theorem: Zn → N (0; 1)

Therefore, P (|Zn| ≤ a) ≈ P (|X | ≤ a) = Φ(a)− Φ(−a), where X ∼ N (0; 1)

and Φ is the cumulative function distribution of N (0; 1).

Calculus:

Φ(−a) + Φ(a) = 1⇒ P (| Zn |≤ a) = Φ(a)− Φ(−a) = 2Φ(a)− 1

P (| Zn |≤ a)=0.95⇔2Φ(a)− 1=0.95⇔Φ(a)=0.975⇔ a ∼= 1.97 (see Φ table)

σ2 not.= Varreal = ereal(1− ereal) because Xi are Bernoulli variables.

Futhermore, we can approximate ereal with esemple, because

E[esample] = ereal and Varsample =1

nVarreal → 0 for n→ +∞,

cf. CMU, 2011 fall, T. Mitchell, A. Singh, HW2, pr. 1.ab.

Finally:

⇒ aσ√n≈ 1.97 ·

0.17(1− 0.17)√100

∼= 0.07

|ereal − esample| ≤ 0.07 ⇔ |ereal − 0.17| ≤ 0.07 ⇔ −0.07 ≤ ereal − 0.17 ≤ 0.07

⇔ ereal ∈ [0.10, 0.24]

31.

Page 33: Foundations of Probabilities

Exemplifying

a mixture of categorical distributions;

how to compute its expectation and variance

CMU, 2010 fall, Aarti Singh, HW1, pr. 2.2.1-2

32.

Page 34: Foundations of Probabilities

Suppose that I have two six-sided dice, one is fair and the other one isloaded – having:

P (x) =

1

2x = 6

1

10x ∈ 1, 2, 3, 4, 5

I will toss a coin to decide which die to roll. If the coin flip is heads I willroll the fair die, otherwise the loaded one. The probability that the coinflip is heads is p ∈ (0, 1).

a. What is the expectation of the die roll (in terms of p).

b. What is the variation of the die roll (in terms of p).

33.

Page 35: Foundations of Probabilities

Solution:

a.

E[X ] =6∑

i=1

i · [P (i|fair) · p+ P (i|loaded) · (1− p)]

=

[6∑

i=1

i · P (i|fair)]

p +

[6∑

i=1

i · P (i|loaded)]

(1− p)

=7

2p+

9

2(1− p) = 9

2− p

34.

Page 36: Foundations of Probabilities

b. Recall that we may write Var (X) = E[X2]− (E[X ])2, therefore:

E[X2] =

6∑

i=1

i2 · [P (i|fair) · p+ P (i|loaded) · (1− p)]

=

[6∑

i=1

i2 · P (i|fair)]

p+

[6∑

i=1

i2 · P (i|loaded)]

(1− p)

=91

6p+ (

36

2+

55

10)(1− p)

=47

2− 25

3p

Combining this with the result of the previous question yields:

Var (X) = E[X2]− (E[X ])2 =141

6− 50

6p− (

9

2− p)2

=141

6− 50

6p− (

81

4− 9p+ p2)

= (141

6− 81

4)− (

50

6− 9)p− p2

=13

4+

2

3p− p2

35.

Page 37: Foundations of Probabilities

Elements of Information Theory:Some examples and then some useful proofs

36.

Page 38: Foundations of Probabilities

Computing entropies and specific conditional entropies

for discrete random variables

CMU, 2012 spring, R. Rosenfeld, HW2, pr. 2

37.

Page 39: Foundations of Probabilities

On the roll of two six-sided fair dice,

a. Calculate the distribution of the sum (S) of the total.

b. The amount of information (or surprise) when seeing the outcome x

for a random variable X is defined as log21

P (X = x)= − log2 P (X = x). How

surprised are you (in bits) to observe S = 2, S = 11, S = 5, S = 7?

c. Calculate the entropy of S [as the expected value of the random variable− log2 P (X = x)].

d. Let’s say you throw the die one by one, and the first die shows 4. Whatis the entropy of S after this observation? Was any information gained / lostin the process? If so, calculate how much information (in bits) was lost orgained.

38.

Page 40: Foundations of Probabilities

a.

S 2 3 4 5 6 7 8 9 10 11 12P (S) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36

b.

Information(S = 2) = − log2(1/36) = log2 36 = 2 log2 6 = 2(1 + log2 3)

= 5.169925001 bits

Information(S = 11) = − log2 2/36 = log2 18 = 1 + 2 log2 3 = 4.169925001 bits

Information(S = 5) = − log2 4/36 = log2 9 = 2 log2 3 = 3.169925001 bits

Information(S = 7) = − log2 6/36 = log2 6 = 1 + log2 3 = 2.584962501 bits

39.

Page 41: Foundations of Probabilities

c.

H(S) = −n∑

i=1

pi log pi

= −(

2 · 136

log1

36+ 2 · 2

36log

2

36+ 2 · 3

36log

3

36+ 2 · 4

36log

4

36+

2 · 536

log5

36+

6

36log

6

36

)

=1

36(2 log2 36 + 4 log2 18 + 6 log2 12 + 8 log2 9 + 10 log2

36

5+ 6 log2 6)

=1

36(2 log2 6

2 + 4 log2 6 · 3 + 6 log2 6 · 2 + 8 log2 32 + 10 log2

62

5+ 6 log2 6)

=1

36(40 log2 6 + 20 log2 3 + 6− 10 log2 5)

=1

36(60 log2 3 + 46− 10 log2 5) = 3.274401919 bits.

40.

Page 42: Foundations of Probabilities

d.

S 2 3 4 5 6 7 8 9 10 11 12P (S|...) 0 0 0 1/6 1/6 1/6 1/6 1/6 1/6 0 0

H(S|First-die-shows-4) = −6 · 16log2

1

6= log2 6 = 2.58 bits,

IG(S;First-die-shows-4) = H(S)−H(S|First-die-shows-4) = 3.27− 2.58 = 0.69 bits.

41.

Page 43: Foundations of Probabilities

Computing entropies and mean conditional entropies

for discrete random variables

CMU, 2012 spring, R. Rosenfeld, HW2, pr. 3

42.

Page 44: Foundations of Probabilities

A doctor needs to diagnose a person having cold (C). The primary factorhe considers in his diagnosis is the outside temperature (T ). The randomvariable C takes two values, yes / no, and the random variable T takes 3values, sunny, rainy, snowy. The joint distribution of the two variables isgiven in following table.

T = sunny T = rainy T = snowy

C = no 0.30 0.20 0.10C = yes 0.05 0.15 0.20

a. Calculate the marginal probabilities P (C), P (T ).

Hint: Use P (X = x) =∑

Y P (X = x;Y = y). For example,

P (C = no) = P (C = no, T = sunny) + P (C = no, T = rainy) + P (C = no, T = snowy).

b. Calculate the entropies H(C), H(T ).

c. Calculate the mean conditional entropies H(C|T ), H(T |C).

43.

Page 45: Foundations of Probabilities

a. PC = (0.6, 0.4) si PT = (0.35, 0.35, 0.30).

b.

H(C) = 0.6 log5

3+ 0.4 log

5

2= log 5− 0.6 log 3− 0.4 = 0.971 bits

H(T ) = 2 · 0.35 log 20

7+ 0.3 log

10

3= 0.7(2 + log 5− log 7) + 0.3(1 + log 5− log 3)

= 1.7 + log 5− 0.7 log 7− 0.3 log 3 = 1.581 bits.

44.

Page 46: Foundations of Probabilities

c.

H(C|T ) def.=

t∈Val (T )

P (T = t) ·H(C|T = t)

= P (T = sunny) ·H(C|T = sunny) + P (T = rainy) ·H(C|T = rainy) +

P (T = snowy) ·H(C|T = snowy)

= 0.35 ·H(

0.30

0.30 + 0.05,

0.05

0.30 + 0.05

)

+ 0.35 ·H(

0.20

0.20 + 0.15,

0.15

0.20 + 0.15

)

+

0.30 ·H(

0.10

0.10 + 0.20,

0.20

0.20 + 0.10

)

=7

20·H(6

7,1

7

)

+7

20·H(4

7,3

7

)

+3

10·H(1

3,2

3

)

=7

20·(6

7log

7

6+

1

7log 7

)

+7

20·(4

7log

7

4+

3

7log

7

3

)

+3

10·(1

3log 3 +

2

3log

3

2

)

=7

20·(

log 7− 6

7− 6

7log 3

)

+7

20·(

log 7− 8

7− 3

7log 3

)

+3

10·(

log 3− 2

3

)

=7

10log 7−

(3

10+

4

10+

2

10

)

−(

6

20+

3

20− 3

10

)

· log 3 =7

10log 7− 3

20log 3− 9

10= 0.82715 bits.

45.

Page 47: Foundations of Probabilities

H(T |C) def.=

c∈Val (C)

P (C = c) ·H(T |C = c)

= P (C = no) ·H(T |C = no) + P (C = yes) ·H(T |C = yes)

= 0.60 ·H(

0.30

0.30 + 0.20 + 0.10,

0.20

0.30 + 0.20 + 0.10,

0.10

0.30 + 0.20 + 0.10

)

+

0.40 ·H(

0.05

0.05 + 0.15 + 0.20,

0.15

0.05 + 0.15 + 0.20,

0.20

0.05 + 0.15 + 0.20

)

=3

5·H(1

2,1

3,1

6

)

+2

5·H(1

8,3

8,1

2

)

=3

5

(1

2+

1

3log 3 +

1

6(1 + log 3)

)

+2

5

(1

8· 3 + 3

8(3− log 3) +

1

2

)

=3

5

(2

3+

1

2log 3

)

+2

5

(

2− 3

8log 3

)

=6

5+

3

20log 3 = 1.43774 bits.

46.

Page 48: Foundations of Probabilities

Computing the entropy of theexponential distribution

CMU, 2011 spring, R. Rosenfeld,

HW2, pr. 2.c

xxxxxx Exponential p.d.f.

0 1 2 3 4 5

0.0

0.5

1.0

1.5

xp(

x)

0 1 2 3 4 5

0.0

0.5

1.0

1.5

θ = 0.5θ = 1θ = 1.5

47.

Page 49: Foundations of Probabilities

Pentru o distributie de probabilitate continua P , entropia se defineste ast-fel:

H(P ) =

∫ +∞

−∞

P (x) log21

P (x)dx

Calculati entropia distributiei continue exponentiale de parametru λ > 0.Definitia acestei distributii este urmatoarea:

P (x) =

λe−λx, daca x ≥ 0;0, daca x < 0.

Indicatie: Daca P (x) = 0, veti presupune ca −P (x) log2 P (x) = 0.

48.

Page 50: Foundations of Probabilities

Answer

H(P ) =

∫ 0

−∞P (x) log2

1

P (x)dx+

∫ ∞

0

P (x) log21

P (x)dx

def. P=

∫ 0

−∞0 log2 0 dx

︸ ︷︷ ︸0

+

∫ ∞

0

λe−λx log21

λe−λxdx =

∫ ∞

0

λe−λx log21

λe−λxdx

⇒ H(P ) =1

ln 2

∫ ∞

0

λe−λx ln1

λe−λxdx =

1

ln 2

∫ ∞

0

λe−λx

(

ln1

λ+ ln

1

e−λx

)

dx

=1

ln 2

∫ ∞

0

λe−λx(− lnλ+ ln eλx

)dx

=1

ln 2

∫ ∞

0

λe−λx (− lnλ+ λx) dx

=1

ln 2

∫ ∞

0

λe−λx(− lnλ) dx+1

ln 2

∫ ∞

0

λe−λxλxdx

=− lnλ

ln 2

∫ ∞

0

λe−λx dx+λ

ln 2

∫ ∞

0

λe−λxx dx

=lnλ

ln 2

∫ ∞

0

(e−λx

)′dx− λ

ln 2

∫ ∞

0

(e−λx

)′x dx

49.

Page 51: Foundations of Probabilities

Prima integrala se rezolva foarte usor:

∫ ∞

0

(e−λx

)′dx = e−λx

∣∣∞0

= e−∞ − e0 = 0− 1 = −1

Pentru a rezolva cea de-a doua integrala se poate folosi formula de integrare prin parti:

∫ ∞

0

(e−λx

)′x dx = e−λxx

∣∣∞0−∫ ∞

0

e−λxx′ dx = e−λxx∣∣∞0−∫ ∞

0

e−λx dx

Integrala definita e−λxx∣∣∞0

nu se poate calcula direct (din cauza conflictului 0 · ∞ carese produce atunci cand lui x i se atribuie valoarea-limita ∞), ci se calculeaza folosindregula lui l’Hopital:

limx→∞

xe−λx = limx→∞

x

eλx= lim

x→∞x′

(eλx)′ = lim

x→∞1

λeλx=

1

λlimx→∞

e−λx = e−∞ = 0,

decie−λxx

∣∣∞0

= 0− 0 = 0.

50.

Page 52: Foundations of Probabilities

Integrala∫∞0 e−λx dx se calculeaza usor:

∫ ∞

0

e−λx dx = − 1

λ

∫ ∞

0

(e−λx

)′dx = − 1

λe−λx

∣∣∞0

= − 1

λ(0− 1) =

1

λ

Prin urmare,∫ ∞

0

(e−λx

)′x dx = 0− 1

λ= − 1

λ,

ceea ce conduce la rezultatul final:

H(P ) =ln λ

ln 2(−1)− λ

ln 2

(

− 1

λ

)

= − lnλ

ln 2+

1

ln 2=

1− lnλ

ln 2.

51.

Page 53: Foundations of Probabilities

Derivation of entropy definition,

starting from a set of desirable properties

CMU, 2005 fall, T. Mitchell, A. Moore, HW1, pr. 2.2

52.

Page 54: Foundations of Probabilities

Remark: The definition we gave for entropy−∑ni=1 pi log pi is not very intuitive.

Theorem:

If ψn(p1, . . . , pn) satisfies the following axioms

A0. [LC:] ψn(p1, . . . , pn) ≥ 0 for any n ∈ N∗ and p1, . . . , pn, since we view ψn is a

measure of disorder ; also, ψ1(1) = 0 because in this case there is no disorder;

A1. ψn should be continuous in pi and symmetric in its arguments;

A2. if pi = 1/n then ψn should be a monotonically increasing function of n;(If all events are equally likely, then having more events means beingmore uncertain.)

A3. if a choice among N events is broken down into successive choices, thenthe entropy should be the weighted sum of the entropy at each stage;

then ψn(p1, . . . , pn) = −K∑

i pi log pi where K is a positive constant.

(As we’ll see, K depends however on ψs

(

1

s, . . . ,

1

s

)

for a certain s ∈ N∗).

Remark: We will prove the theorem firstly for uniform distributions (pi = 1/n) and secondly

for the case pi ∈ Q (only!).

53.

Page 55: Foundations of Probabilities

Example for the axiom A3:

Encoding 1

(a,b,c)

1/2 1/3 1/6

b ca

Encoding 2

(a,b,c)

a

1/2 1/2

(b,c)

b c

2/3 1/3

H

(

1

2,1

3,1

6

)

=1

2log 2 +

1

3log 3 +

1

6log 6 =

(

1

2+

1

6

)

log 2 +

(

1

3+

1

6

)

log 3 =2

3+

1

2log 3

H

(

1

2,1

2

)

+1

2H

(

2

3,1

3

)

= 1 +1

2

(

2

3log

3

2+

1

3log 3

)

= 1 +1

2

(

log 3− 2

3

)

=2

3+

1

2log 3

The next 3 slides:

Case 1: pi = 1/n for i = 1, . . . , n; proof steps:

54.

Page 56: Foundations of Probabilities

a. A(n)not.= ψ(1/n, 1/n, . . . , 1/n) implies

A(sm) = mA(s) for any s,m ∈ N∗. (1)

b. If s,m ∈ N⋆ (fixed), s 6= 1, and t, n ∈ N⋆ such that sm ≤ tn ≤ sm+1, then

| mn −log tlog s |≤ 1

n . (2)

c. For sm ≤ tn ≤ sm+1 as above, due to A2 it follows (immediately)

ψsm

(1

sm, . . . ,

1

sm

)

≤ ψtn

(1

tn, . . . ,

1

tn

)

≤ ψsm+1

(1

sm+1, . . . ,

1

sm+1

)

i.e. A(sm) ≤ A(tn) ≤ A(sm+1)c. Show that

| mn −A(t)A(s) |≤ 1

n for s 6= 1. (3)

d. Combining (2) + (3) immediately gives

| A(t)A(s) −log tlog s |≤ 2

n pentru s 6= 1 (4)

d. Show that this inequation implies

A(t) = K log t with K > 0 (due to A2). (5)

55.

Page 57: Foundations of Probabilities

Proofa.

. . . . . .. . .

. . . . . .. . .

. . .

1/s 1/s 1/s

1/s 1/s 1/s

nivel 2

nivel 1

1/s 1/s 1/s

nivel m

de s ori

. . .de s ori

de s ori

ms

1m

s

1

de s orim

. . .

ms

1

Applying the axion A3 on the right encoding from above gives:

A(sm) = A(s) + s · 1sA(s) + s2 · 1

s2A(s) + . . .+ sm−1 · 1

sm−1A(s)

= A(s) +A(s) +A(s) + . . .+A(s)︸ ︷︷ ︸

m times

= mA(s)

56.

Page 58: Foundations of Probabilities

Proof (cont’d)

b.

sm ≤ tn ≤ sm+1 ⇒ m log s ≤ n log t ≤ (m+ 1) log s⇒m

n≤ log t

log s≤ m

n+

1

n⇒ 0 ≤ log t

log s− m

n≤ 1

n⇒∣∣∣∣

log t

log s− m

n

∣∣∣∣≤ 1

n

c.

A(sm) ≤ A(tn) ≤ A(sm+1)(1)⇒ mA(s) ≤ n A(t) ≤ (m+ 1) A(s)

s 6=1⇒m

n≤ A(t)

A(s)≤ m

n+

1

n⇒ 0 ≤ A(t)

A(s)− m

n≤ 1

n⇒∣∣∣∣

A(t)

A(s)− m

n

∣∣∣∣≤ 1

n

d. Consider again sm ≤ tn ≤ sm+1 with s, t fixed. If m→∞ then n→∞ and

from

∣∣∣∣

A(t)

A(s)− log t

log s

∣∣∣∣≤ 2

nit follows that

∣∣∣∣

A(t)

A(s)− log t

log s

∣∣∣∣→ 0.

Therefore

∣∣∣∣

A(t)

A(s)− log t

log s

∣∣∣∣= 0 and so

A(t)

A(s)=

log t

log s.

Finally, A(t) =A(s)

log slog t = K log t, where K =

A(s)

log s> 0 (if s 6= 1).

57.

Page 59: Foundations of Probabilities

Case 2: pi ∈ Q for i = 1, . . . , n

Let’s consider a set of N ≥ 2equiprobable random events, and P =(S1, S2, . . . , Sk) a partition of this set.Let’s denote pi =| Si | /N .

A “natural” two-step ecoding (asshown in the nearby figure) leads toA(N) = ψk(p1, . . . , pk) +

i piA(| Si |),based on the axiom A3.

Finally, using the result A(t) = K log t,gives:

K logN = ψk(p1, . . . , pk) +K∑

i pi log | Si |

S i

1/|S |i1/|S |i 1/|S |i

|S |/N2 |S |/Ni

S 1 S 2 S k

|S |/Nk|S |/N1

level 1

level 2

. . .. . .

. . .

. . . . . . . . .

⇒ ψk(p1, . . . , pk) = K[ logN −∑

i

pi log | Si | ]

= K[ logN∑

i

pi −∑

i

pi log | Si | ] = −K∑

i

pi log| Si |N

= −K∑

i

pi log pi

58.

Page 60: Foundations of Probabilities

Entropie, entropie corelata,

entropie conditionala, castig de informatie:

definitii si proprietati imediate

CMU, 2005 fall, T. Mitchell, A. Moore, HW1, pr. 2

59.

Page 61: Foundations of Probabilities

Definitii

• Entropia variabilei X:

H(X)def.= −∑i P (X = xi) logP (X = xi)

not.= EX [− logP (X)].

• Entropia conditionala specifica a variabilei Y ın raport cu valoarea xk a variabilei X:

H(Y | X = xk)def.= −∑j P (Y = yj | X = xk) logP (Y = yj | X = xk)

not.= EY |X=xk

[− logP (Y | X = xk)].

• Entropia conditionala medie a variabilei Y ın raport cu variabila X:

H(Y | X)def.=∑

k P (X = xk)H(Y | X = xk)not.= EX [H(Y | X)].

• Entropia corelata a variabilelor X si Y :

H(X,Y )def.= −∑i

j P (X = xi, Y = yj) logP (X = xi, Y = yj)not.= EX,Y [− logP (X,Y )].

• Informatia mutuala a variabilelor X si Y , numita de asemenea castigul de informatieal variabilei X ın raport cu variabila Y (sau invers):

MI(X,Y )not.= IG(X,Y )

def.= H(X)−H(X | Y ) = H(Y )−H(Y | X)

(Observatie: ultima egalitate de mai sus are loc datorita rezultatului de la punctulc de mai jos.)

60.

Page 62: Foundations of Probabilities

a.H(X) ≥ 0.

H(X) = −∑

i

P (X = xi) logP (X = xi) =∑

i

P (X = xi)︸ ︷︷ ︸

≥0

log1

P (X = xi)︸ ︷︷ ︸

≥0

≥ 0

Mai mult, H(X) = 0 daca si numai daca variabila X este constanta:

,,⇒“ Presupunem ca H(X) = 0, adica∑

i P (X = xi) log1

P (X = xi)= 0. Datorita faptului

ca fiecare termen din aceasta suma este mai mare sau egal cu 0, rezulta ca H(X) = 0

doar daca pentru ∀i, P (X = xi) = 0 sau log1

P (X = xi)= 0, adica daca pentru ∀i,

P (X = xi) = 0 sau P (X = xi) = 1. Cum ınsa∑

i P (X = xi) = 1 rezulta ca exista osingura valoare x1 pentru X astfel ıncat P (X = x1) = 1, iar P (X = x) = 0 pentruorice x 6= x1. Altfel spus, variabila aleatoare discreta X este constanta.

,,⇐“ Presupunem ca variabila X este constanta, ceea ce ınseamna ca X ia o singuravaloare x1, cu probabilitatea P (X = x1) = 1. Prin urmare, H(X) = −1 · log 1 = 0.

61.

Page 63: Foundations of Probabilities

b.

H(Y | X) = −∑

i

j

P (X = xi, Y = yj) logP (Y = yj | X = xi)

H(Y | X) =∑

i

P (X = xi)H(Y | X = xi)

=∑

i

P (X = xi)

[

−∑

j

P (Y = yj | X = xi) logP (Y = yj | X = xi)

]

= −∑

i

j

P (X = xi)P (Y = yj | X = xi)︸ ︷︷ ︸

=P (X=xi,Y=yj)

logP (Y = yj | X = xi)

= −∑

i

j

P (X = xi, Y = yj) logP (Y = yj | X = xi)

62.

Page 64: Foundations of Probabilities

c.H(X, Y ) = H(X) +H(Y | X) = H(Y ) +H(X | Y )

H(X, Y ) = −∑

i

j

p(xi, yj) log p(xi, yj)

= −∑

i

j

p(xi) · p(yj | xi) log[p(xi) · p(yj | xi)]

= −∑

i

j

p(xi) · p(yj | xi)[log p(xi) + log p(yj | xi)]

= −∑

i

j

p(xi) · p(yj | xi) log p(xi)−∑

i

j

p(xi) · p(yj | xi) log p(yj | xi)

= −∑

i

p(xi) log p(xi) ·∑

j

p(yj | xi)︸ ︷︷ ︸

=1

−∑

i

p(xi)∑

j

p(yj | xi) log p(yj | xi)

= H(X) +∑

i

p(xi)H(Y | X = xi) = H(X) +H(Y | X)

63.

Page 65: Foundations of Probabilities

Mai general (regula de ınlantuire):

H(X1, . . . , Xn) = H(X1) +H(X2 | X1) + . . .+H(Xn | X1, . . . , Xn−1)

H(X1, . . . , Xn) =E

[

log1

p(x1, . . . , xn)

]

=− Ep(x1,...,xn)[log p(x1, . . . , xn)]

=− Ep(x1,...,xn)[log p(x1) + log p(x2 | x1) + . . .+ log p(xn | x1, . . . , xn−1)]

=− Ep(x1)[log p(x1)]−Ep(x1,x2)[log p(x2 | x1)]− . . .− Ep(x1,...,xn)[log p(xn | x1, . . . , xn−1)]

=H(X1) +H(X2 | X1) + . . .+H(Xn | X1, . . . , Xn−1)

64.

Page 66: Foundations of Probabilities

An upper bound for the entropy of a discrete distribution

CMU, 2003 fall, T. Mitchell, A. Moore, HW1, pr. 1.1

65.

Page 67: Foundations of Probabilities

Fie X o variabila aleatoare discreta care ia n valori si urmeaza distributiaprobabilista P . Conform definitiei, entropia lui X este

H(X) = −n∑

i=1

P (X = xi) log2 P (X = xi).

Aratati ca H(X) ≤ log2 n.

Sugestie: Puteti folosi inegalitatea lnx ≤ x − 1 care are loc pentru oricex > 0.

66.

Page 68: Foundations of Probabilities

Answer

H(X) =1

ln 2

(

−n∑

i=1

P (X = xi) lnP (X = xi)

)

Asadar,

H(X) ≤ log2 n ⇔ 1

ln 2

(

−n∑

i=1

P (X = xi) lnP (X = xi)

)

≤ log2 n

⇔ −n∑

i=1

P (xi) lnP (xi) ≤ lnn

⇔n∑

i=1

P (xi) ln1

P (xi)−(

n∑

i=1

P (xi)

)

︸ ︷︷ ︸

1

lnn ≤ 0

⇔n∑

i=1

P (xi) ln1

P (xi)−

n∑

i=1

P (xi) lnn ≤ 0

⇔n∑

i=1

P (xi)

(

ln1

P (xi)− lnn

)

≤ 0

⇔n∑

i=1

P (xi) ln1

nP (xi)≤ 0

67.

Page 69: Foundations of Probabilities

Aplicand inegalitatea lnx ≤ x− 1 pentru x =1

nP (xi), vom avea:

n∑

i=1

P (xi) ln1

nP (xi)≤

n∑

i=1

P (xi)

(1

nP (xi)− 1

)

=

n∑

i=1

1

n−

n∑

i=1

P (xi)

︸ ︷︷ ︸1

= 1− 1 = 0

Observatie: Aceasta margine superioara chiar este ,,atinsa“. De exemplu, ın cazul ıncare o variabila aleatoare discreta X avand n valori urmeaza distributia uniforma, sepoate verifica imediat ca H(X) = log2 n.

68.

Page 70: Foundations of Probabilities

Relative entropy a.k.a. the Kulback-Leibler divergence,

and the [relationship to] information gain;

some basic properties

CMU, 2007 fall, C. Guestrin, HW1, pr. 1.2[adapted by Liviu Ciortuz]

69.

Page 71: Foundations of Probabilities

The relative entropy — also known as the Kullback-Leibler (KL) diver-gence — from a distribution p to a distribution q is defined as

KL(p||q) def.= −

x∈X

p(x) logq(x)

p(x)

From an information theory perspective, the KL-divergence specifies thenumber of additional bits required on average to transmit values of X ifthe values are distributed with respect to p but we encode them assumingthe distribution q.

70.

Page 72: Foundations of Probabilities

Notes

1. KL is not a distance measure, since it is not symmetric (i.e., in generalKL(p||q) 6= KL(q||p)).

Another measure, which is defined as JSD(p||q) = 1

2(KL(p||q)+KL(q||p)), and

is called the Jensen-Shannon divergence is symmetric.

2. The quantity

d(X, Y )def.= H(X, Y )− IG (X, Y ) = H(X) +H(Y )− 2IG (X, Y )

= H(X | Y ) +H(Y | X)

known as variation of information, is a distance metric, i.e., it is non-negative, symmetric, implies indiscernability, and satisfies the triangle in-equality.

71.

Page 73: Foundations of Probabilities

a. Show that KL(p||q) ≥ 0, and KL(p||q) = 0 iff p(x) = q(x) for all x.(More generally, the smaller the KL-divergence, the more similar the two distributions.)

Indicatie:

Pentru a demonstra punctul acesta puteti folosi inegalitatea lui Jensen:

Daca ϕ : R → R este o functie convexa, atunci pentru orice t ∈ [0, 1] si orice x1, x2 ∈ R

urmeaza ϕ(tx1 + (1− t)x2) ≤ tϕ(x1) + (1− t)ϕ(x2).Daca ϕ este functie strict convexa, atunci egalitatea are loc doar daca x1 = x2.

Mai general, pentru orice ai ≥ 0, i = 1, . . . , n cu∑

i ai 6= 0 si orice xi ∈ R, i = 1, . . . , n, avem

ϕ

(∑

i aixi∑

j aj

)

≤∑

i aiϕ(xi)∑

j aj.

Daca ϕ este strict convexa, atunci egalitatea are loc doar daca x1 = . . . = xn.

Evident, rezultate similare pot fi formulate si pentru functii concave.

72.

Page 74: Foundations of Probabilities

Answer

Vom dovedi inegalitatea KL(p||q) ≥ 0 folosind inegalitatea lui Jensen, ın expresia careia

vom ınlocui ϕ cu functia convexa − log2, pe ai cu p(xi) si pe xi cuq(xi)

p(xi).

(Pentru convenienta, ın cele ce urmeaza vor renunta la indicele variabilei x.)Vom avea:

KL (p || q) def.= −

x

p(x) logq(x)

p(x)

Jensen

≥ − log(∑

x

p(x)q(x)

p(x)) = − log(

x

q(x)

︸ ︷︷ ︸1

) = − log 1 = 0

Asadar, KL (p || q) ≥ 0, oricare ar fi distributiile (discrete) p si q.

73.

Page 75: Foundations of Probabilities

Vom demonstra acum ca KL(p||q) = 0⇔ p = q.

⇐Egalitatea p(x) = q(x) implica

q(x)

p(x)= 1, deci log

q(x)

p(x)= 0 pentru orice x, de unde

rezulta imediat KL(p||q) = 0.

⇒Stim ca ın inegalitatea lui Jensen are loc egalitatea doar ın cazul ın care xi = xjpentru orice i si j.

In cazul de fata, aceasta conditie se traduce prin faptul ca raportulq(x)

p(x)este acelasi

pentru orice valoare a lui x.

Tinand cont ca∑

x p(x) = 1 si∑

x p(x)q(x)

p(x)=∑

x q(x) = 1, rezulta caq(x)

p(x)= 1 sau,

altfel spus, p(x) = q(x) pentru orice x, ceea ce ınseamna ca distributiile p si q suntidentice.

74.

Page 76: Foundations of Probabilities

b. We can define the information gain as the KL-divergence from the observed jointdistribution of X and Y to the product of their observed marginals:

IG(X,Y )def.= KL(pX,Y || (pX pY )) = −

x

y

pX,Y (x, y) log

(pX(x)p(y)

pY (x, y)

)

not.= −

x

y

p(x, y) log

(p(x)p(y)

p(x, y)

)

Prove that this definition of information gain is equivalent to the one given in problemCMU, 2005 fall, T. Mitchell, A. Moore, HW1, pr. 2. That is, show that IG(X,Y ) =H [X ]−H [X |Y ] = H [Y ]−H [Y |X ], starting from the definition in terms of KL-divergence.

Remark:

It follows that

IG(X,Y ) =∑

y

p(y)∑

x

p(x | y) log p(x | y)p(x)

=∑

y

p(y)KL(pX|Y || pX)

= EY [KL(pX|Y || pX)]

75.

Page 77: Foundations of Probabilities

Answer

By making use of the multiplication rule, namely p(x, y) = p(x | y)p(y), we will have:

KL(pXY || (pX pY ))

def. KL= −

x

y

p(x, y) log

(p(x)p(y)

p(x, y)

)

= −∑

x

y

p(x, y) log

(p(x)p(y)

p(x | y)p(y)

)

= −∑

x

y

p(x, y)[log p(x)− log p(x | y)]

= −∑

x

y

p(x, y) log p(x)−(

−∑

x

y

p(x, y) log p(x | y))

= −∑

x

log p(x)∑

y

p(x, y)

︸ ︷︷ ︸

=p(x)

−H [X | Y ]

= H [X ]−H [X | Y ] = IG(X,Y )

76.

Page 78: Foundations of Probabilities

c.A direct consequence of parts a. and b. is that IG(X, Y ) ≥ 0 (and thereforeH(X) ≥ H(X|Y ) and H(Y ) ≥ H(Y |X)) for any discrete random variables Xand Y .

Prove that IG(X, Y ) = 0 iff X and Y are independent.

Answer:

This is also an immediate consequence of parts a. and b. already proven:

IG (X, Y ) = 0(b)⇔ KL (pXY ||pX pY ) = 0

(a)⇔ X and Y are independent.

77.

Page 79: Foundations of Probabilities

Remark

Putem demonstra inegalitatea IG(X,Y ) ≥ 0 si ın maniera directa, folosind rezultatul dela punctul b. si aplicand inegalitatea lui Jensen ın forma generalizata, cu urmatoarele,,amendamente“:

− ın locul unui singur indice, se vor considera doi indici (asadar ın loc de ai si xi vomavea aij si respectiv xij);

− vom lua ϕ = − log2 iar aij ← p(xi, yj) si xij ←p(xi)p(xj)

p(xi, xj);

− ın fine, vom tine cont ca∑

i

j p(xi, yj) = 1.

Prin urmare,

IG(X,Y ) =∑

i

j

p(xi, yj) logp(xi, yj)

p(xi) · p(yj)=∑

i

j

p(xi, yj)

[

− logp(xi) · p(yj)p(xi, yj)

]

≥ − log

i

j

p(xi, yj)p(xi) · p(yj)p(xi, yj)

= − log(∑

i

j

p(xi) · p(yj))

= − log(∑

i

p(xi)

︸ ︷︷ ︸1

·∑

j

p(yj)

︸ ︷︷ ︸1

) = − log 1 = 0

In concluzie, IG(X,Y ) ≥ 0.

78.

Page 80: Foundations of Probabilities

Remark (cont’d)

Daca X si Y sunt variabilele independente,atunci p(xi, yj) = p(xi)p(yj) pentru orice i si j.

In consecinta, toti logaritmii din partea dreapta a primei egalitati din calculul de maisus sunt 0 si rezulta IG(X,Y ) = 0.

Invers, presupunand ca IG(X,Y ) = 0, vom tine cont de faptul ca putem exprima castigulde informatie cu ajutorul divergentei KL si vom aplica un rationament similar cu celde la punctul a.

Rezulta cap(xi)p(yj)

p(xi, yj)= 1 si deci p(xi)p(yj) = p(xi, yj) pentru orice i si j.

Aceasta echivaleaza cu a spune ca variabilele X si Y sunt independente.

79.

Page 81: Foundations of Probabilities

Proving [in a direct manner] thatthe Information Gain is always positive or 0

(an indirect proof was made at CMU, 2007 fall, Carlos Guestrin, HW1, pr. 1.2)

Liviu Ciortuz, 2017

80.

Page 82: Foundations of Probabilities

Definitia castigului de informatie (sau: a informatiei mutuale) al unei vari-abile aleatoare X ın raport cu o alta variabila aleatoare Y este

IG (X,Y ) = H(X)−H(X | Y ) = H(Y )−H(Y | X).

La CMU, 2007 fall, Carlos Guestrin, HW1, pr. 1.2 s-a demonstrat — pentru cazul ın care X

si Y sunt discrete — ca IG(X,Y ) = KL(PX,Y ||PX PY ), unde KL desemneaza entropia relativa (sau:

divergenta Kullback-Leibler), PX si PY sunt distributiile variabilelor X si, respectiv, Y , iar PX,Y este

distributia corelata a acestor variabile. Tot la CMU, 2007 fall, Carlos Guestrin, HW1, pr. 1.2 s-a

aratat ca divergenta KL este ıntotdeauna ne-negativa. In consecinta, IG(X,Y ) ≥ 0 pentru orice X

si Y .

La acest exercitiu va cerem sa demonstrati inegalitatea IG(X,Y ) ≥ 0 ınmaniera directa, plecand de la prima definitie data mai sus, fara a [mai] apelala divergenta Kullback-Leibler.

81.

Page 83: Foundations of Probabilities

Sugestie: Puteti folosi urmatoarea forma a inegalitatii lui Jensen:

n∑

i=1

ai log xi ≤ log

(n∑

i=1

aixi

)

unde baza logaritmului se considera supraunitara, ai ≥ 0 pentru i = 1, . . . , n si∑n

i=1 ai = 1.

Observatie: Avantajul la aceasta problema, comparativ cu CMU, 2007 fall, Carlos Guestrin,

HW1, pr. 1.2.a, este ca aici se lucreaza cu o singura distributie (p), nu cu doua distributii (p si q).

Totusi, demonstratia de aici va fi mai laborioasa.

Answer (in Romanian)

Presupunem ca valorile variabilei X sunt x1, x2, . . . , xn, iar valorile variabilei Ysunt y1, y2, . . . , ym. Avem:

IG(X,Y )def.= H(X)−H(X |Y )

def.=

n∑

i=1

−P (xi) log2 P (xi)−m∑

j=1

P (yj)

n∑

i=1

(−P (xi|yj) log2 P (xi|yj))

82.

Page 84: Foundations of Probabilities

−IG(X,Y ) =

n∑

i=1

P (xi) log2 P (xi)−m∑

j=1

P (yj)

n∑

i=1

P (xi|yj) log2 P (xi|yj)

def.=

prob. marg.

n∑

i=1

m∑

j=1

P (xi, yj)

log2 P (xi)−m∑

j=1

P (yj)

n∑

i=1

P (xi|yj) log2 P (xi|yj)

distrib.·,+=

n∑

i=1

m∑

j=1

P (xi, yj) log2 P (xi)−m∑

j=1

n∑

i=1

P (yj)P (xi|yj) log2 P (xi|yj)

def.=

prob. cond.

n∑

i=1

m∑

j=1

P (xi, yj) log2 P (xi)−m∑

j=1

n∑

i=1

P (xi, yj) log2 P (xi|yj)

distrib.·,+=

n∑

i=1

m∑

j=1

P (xi, yj)(log2 P (xi)− log2 P (xi|yj))

prop.=log.

n∑

i=1

m∑

j=1

P (xi, yj) log2P (xi)

P (xi|yj)

reg. de=

multipl.

n∑

i=1

m∑

j=1

P (xi|yj)P (yj) log2P (xi)

P (xi|yj)

distrib.·,+=

m∑

j=1

P (yj)

n∑

i=1

P (xi|yj)︸ ︷︷ ︸

ai

log2P (xi)

P (xi|yj)

83.

Page 85: Foundations of Probabilities

Intrucat pe de o parte P (xi|yj) ≥ 0 si pe de alta parte∑n

i=1 P (xi|yj) = 1 pentrufiecare valoare yj a lui Y ın parte, putem aplica inegalitatea lui Jensen pentrucea de-a doua suma din ultima expresie de mai sus — mai exact, pentrufiecare valoare a indicelui j ın parte — si obtinem:

−IG(X,Y ) ≤m∑

j=1

P (yj) log2

(n∑

i=1

P (xi|yj)P (xi)

P (xi|yj)

)

=m∑

j=1

P (yj) log2

(n∑

i=1

P (xi)

)

︸ ︷︷ ︸1

= 0

Prin urmare, IG(X,Y ) ≥ 0.

84.