Introduction to Machine Learning Lecture 14mohri/mlu/mlu_lecture_14.pdf · Mehryar Mohri - Introduction to Machine Learning page Maximum Entropy Principle For large , we can give

Introduction to Machine LearningLecture 14

Mehryar MohriCourant Institute and Google Research

[email protected]

mailto:[email protected]

mailto:[email protected]

pageMehryar Mohri - Introduction to Machine Learning 2

Density EstimationMaxent Models

pageMehryar Mohri - Introduction to Machine Learning

Entropy

Definition: the entropy of a random variable with probability distribution is

Properties:

• Measure of uncertainty of .

•

• Maximal for uniform distribution. For a finite support, by Jensen’s inequality:

3

X

p(x) = Pr[X = x]

H(X) = !E[log p(X)] = !

!

x

p(x) log p(x).

p(x)

H(X) ! 0.

H(X) = E[log 1/p(X)] ! log E[1/p(X)] = log N.


Relative Entropy

Definition: the relative entropy (or Kullback-Leibler divergence) of two distributions and is

with

Properties:• Assymetric measure of deviation between two

distributions. It is convex in and .

• for all and .

• iff

4

p q

D(p ! q) " 0 p q

D(p ! q) = 0 p = q.

D(p ! q) = Ep[logp(X)

q(X)] =

!

x

p(x) logp(x)

q(x),

0 log0

q= 0 and p log

p

0= !.

p q


Jensen’s Inequality

Theorem: let be a random variable and a measurable convex function. Then,

Proof:

• For a distribution over a finite set, the property follows directly the definition of convexity.

• The general case is a consequence of the continuity of convex functions and the density of finite distributions.

5

X f

f(E[X]) ! E[f(X)].


Applications

Non-negativity of relative entropy:

6

!D(p " q) = Ep[log q(X)p(X) ]

# log Ep[q(X)p(X) ]

= log!

x q(x) = 0.


Hoeffding’s Bounds

Theorem: let be a sequence of independent Bernoulli trials taking values in , then for all , the following inequalities hold for

7

X1, X2, . . . , Xm

! > 0

Xm =1

m

!m

i=1Xi

Pr[Xm ! E[Xm] " !!] " e!2m!

2

.

Pr[Xm ! E[Xm] " !] # e!2m!

2

.

[0, 1]


Problem

Data: sample drawn i.i.d. from set according to some distribution ,

Problem: find distribution out of a set that best estimates .

8

D

x1, . . . , xm ! X.

X

D

p P


Maximum Likelihood

Likelihood: probability of observing sample under distribution , which, given the independence assumption is

Principle: select distribution maximizing sample probability

9

Pr[x1, . . . , xm] =m!

i=1

p(xi).

p ! P

p! = argmaxp!P

m!

i=1

p(xi),

p! = argmaxp!P

m!

i=1

log p(xi).or


Relative Entropy Formulation

Empirical distribution: distribution that assigns to each point the frequency of its occurrence in the sample.

Lemma: has maximum likelihood iff

Proof:

10

p̂

p! = argminp!P

D(p̂ ! p).

p!

D(p̂ ! p) =!

zi observed

p̂(zi) log p̂(zi) "!

zi observed

p̂(zi) log p(zi)

= "H(p̂) ""

zi

count(zi)N0

log p(zi)

= "H(p̂) " 1N0

log#

zip(zi)count(zi)

= "H(p̂) " 1N0

l(p).

l(p!)


Maximum A Posteriori (MAP)

Principle: select the most likely hypothesis given the sample, with some prior distribution over the hypotheses, ,

Note: for a uniform prior, MAP coincides with maximum likelihood.

11

h ! H

Pr[h]

h! = argmaxh!H

Pr[h | S]

= argmaxh!H

Pr[S | h] Pr[h]

Pr[S]

= argmaxh!H

Pr[S | h] Pr[h].


Problem

Data: sample drawn i.i.d. from set according to some distribution ,

Features: mappings associated to elements of ,

Problem: how do we estimate distribution ? Uniform distribution over ?

12

D

X

X

D

Xu

x1, . . . , xm ∈ X.

φ1, . . . , φN : X → R.


Features

Examples: statistical language modeling.

• n-grams, distance-d n-grams.

• class-based n-grams, word triggers.

• sentence length.

• number and type of verbs.

• various grammatical information (e.g., agreement, POS tags).

• dialog-level information.

13


Maximum Entropy Principle

For large , we can give a fairly good estimate of the expected value of each feature (Hoeffding’s inequality):

Find distribution that is closest to the uniform distribution and that preserves the expected values of features.

Closeness is measured using relative entropy (or Kullback-Liebler divergence).

14

m

u

(E. T. Jaynes, 1957, 1983)

∀j ∈ [1, N ], Ex∼D

[φj(x)] ≈ 1m

m�

i=1

φj(xi).


Maximum Entropy Formulation

Distributions: let denote the set of distributions

Optimization problem: find distribution verifying

15

P

p!

p� = argminp∈P

D(p � u).

P =

�p : E

x∼p[φj(x)] =

1m

m�

i=1

φj(xi), j ∈ [1, n]

�.


Relation with Entropy Maximization

Relationship with entropy:

Optimization problem: convex problem.

16

D(p � u) =�

x∈X

p(x) logp(x)1/|X |

= log |X | +�

x∈X

p(x) log p(x)

= log |X |−H(p).

minimize�

x∈X

p(x) log p(x)

subject to�

x∈X

p(x) = 1

�

x∈X

p(x)φj(x) =1m

m�

i=1

φj(xi), ∀j ∈ [1, N ].


Maximum Likelihood Gibbs Distrib.

Gibbs distributions: set of distributions defined by

Maximum likelihood with Gibbs distributions:

17

with

where is the closure of .Q̄ Q

Q

p� = argmaxq∈Q̄

m�

i=1

log q(xi) = argminq∈Q̄

D(p̂ � q),

p(x) =1Z

exp(w · Φ(x)) =1Z

exp� N�

j=1

wj · Φj(x)�,

Z =�

x

exp(w · Φ(x)).


Duality Theorem

Theorem: Assume that Then, there exists a unique probability distribution satisfying1. 2. for any

and (Pythagorean equality);3. (maximum likelihood);

4.

Each of these properties determines uniquely.

18

D(p̂ ! u) < ".

p ! P

q ! Q̄

p!

p! ! P " Q̄;

D(p ! q) = D(p ! p!) + D(p! ! q)

p! = argminp!P

D(p ! u)

p! = argminq!Q̄

D(p̂ ! q)

(Della Pietra et al., 1997)

p!


Regularization

Overfitting:

• Features with low counts;

• Very large feature weights.

Approximation parameters: , ;

19

βj≥0 j∈ [1, n]

∀j ∈ [1, n],�� E

x∼D[φj(x)]− 1

m

m�

i=1

φj(xi)�� ≤ βj .


Regularized Maxent

Optimization problem:

Regularization: or norm, also interpreted as Gaussian or Laplacian priors (Bayesian view). E.g.,

20

argminw

λ�w�2 − 1m

m�

i=1

log pw[xi],

with pw[x]= 1Z(x) exp(w · Φ(x)).

L1 L2

where Pr[w] =�n

j=1e−w2

j /(2σ2j )√

2πσ2j

.

p̃(x)← pw(x) Pr[w]

l̃(w)← l(w)−n�

j=1

w2j

2σ2j

− 12

n�

j=1

log(2πσ2j ),


Extensions - Bregman Divergences

Definition: let be a convex and differentiable function, then the Bregman divergence based on is defined as

Examples:

• Unnormalized relative entropy.

• Euclidean distance.

21

F

F

x y

F (y)

F (x) + (y ! x) ·"xF (x)

BF (y, x) = F (y)− F (x) − (y − x) ·∇xF (x).


Conditional Maxent Models

Generalization of Maxent models:

• multi-class classification.

• conditional probability modeling each class.

• different features for each class.

• logistic regression: special case of two classes.

• also known as multinomial logistic regression.

22


Problem

Data: sample drawn i.i.d. according to some distribution .

Multi-label case:

Features: for any , feature vector

Problem: find distribution out of set that best estimates the probability distributions , for all classes .

23

p P

D

S = ((x1, y1), . . . , (xm, ym))∈(X×{1, . . . , k})m.

S = ((x1, y1), . . . , (xm, ym))∈(X×{−1, +1}k)m.

y∈{1, . . . , k}

x �→ Φ(x, y).

Pr[y | x]y∈{1, . . . , k}


Conditional Maxent

Optimization problem: maximizing conditional entropy.

24

minimize�

y∈[1,k]

�

x∈X

p(y|x) log p(y|x)

subject to ∀y ∈ [1, k],�

x∈X

p(y|x) = 1

∀j ∈ [1, N ],�

x∈X

�

y∈[1,k]

p(y|x)Φj(x, y) =1m

m�

i=1

Φj(xi, yi).


Equivalent ML Formulation

Optimization problem: maximize conditional log-likelihood.

25

maxw∈H

1m

m�

i=1

log pw[yi|xi]

subject to: ∀(x, y) ∈ X × [1, k],

pw[y|x]=1

Z(x)exp(w · Φ(x, y))

Z(x)=�

y∈Y

exp(w · Φ(x, y)).


Regularized Conditional Maxent

Optimization problem: maximizing conditional entropy, regularization parameter .

26

minw∈H

λ�w�2 − 1m

m�

i=1

log pw[yi|xi]

subject to: ∀(x, y) ∈ X × [1, k],

pw[y|x]=1

Z(x)exp(w · Φ(x, y))

Z(x)=�

y∈Y

exp(w · Φ(x, y)).

λ ≥ 0

Different norms used for regularization.


Logistic Regression

Logistic model: .

Properties:

• linear log-odds ratio (or logit):

• decision rule: sign of log-odds ratio.

27

with

(Berkson, 1944)

k=2

logPr[y2 | x]Pr[y1 | x]

= w · (Φ(x, y2)−Φ(x, y1)).

Pr[+1 | x] =1

1 + e−w·(Φ(x,+1)−Φ(x,−1).

Pr[y | x] =1

Z(x)ew·Φ(x,y),

Z(x) =�

y∈[1,k]

ew·Φ(x,y).


References• Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. A maximum entropy

approach to natural language processing. Computational Linguistics, (22-1), March 1996;

• Berkson, J. (1944). Application of the logistic function to bio-assay. Journal of the American Statistical Association 39, 357–365.

• Imre Csiszar and Tusnady. Information geometry and alternating minimization procedures. Statistics and Decisions, Supplement Issue 1, 205-237, 1984.

• Imre Csiszar. A geometric interpretation of Darroch and Ratchliff's generalized iterative scaling. The Annals of Statisics, 17(3), pp. 1409-1413. 1989.

• J. Darroch and D. Ratchliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5), pp. 1470-1480, 1972.

• Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:4, pp.380--393, April, 1997.

28

http://www.jstor.org/view/00905364/di983943/98p03344/0?config=jstor&frame=noframe&[email protected]/018dd553180050786e44&dpi=3






http://www.cs.cmu.edu/~lafferty/ps/ifrf.ps





References• Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Duality and auxiliary functions

for Bregman distances. Technical Report CMU-CS-01-109, School of Computer Science, CMU, 2001.

• E. Jaynes. Information theory and statistical mechanics. Physics Reviews, 106:620–630, 1957.

• E. Jaynes. Papers on Probability, Statistics, and Statistical Physics. R. Rosenkrantz (editor), D. Reidel Publishing Company, 1983.

• O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectation-maximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, 1998.

• Roni Rosenfeld. A maximum entropy approach to adaptive statistical language modelling.Computer, Speech and Language 10:187--228, 1996.

29

http://www-2.cs.cmu.edu/~lafferty/ps/duality.ps




http://bayes.wustl.edu/etj/etj.html

http://bayes.wustl.edu/etj/etj.html

http://www.essrl.wustl.edu/~jao/

http://www.essrl.wustl.edu/~jao/

http://www.cs.cmu.edu/~roni/me-csl-revised.ps

http://www.cs.cmu.edu/~roni/me-csl-revised.ps

Documents

Introduction to Machine Learning Lecture 14mohri/mlu/mlu_lecture_14.pdf · Mehryar Mohri - Introduction to Machine Learning page Maximum Entropy Principle For large , we can give