Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Introduction to Machine LearningLecture 14
Mehryar MohriCourant Institute and Google Research
pageMehryar Mohri - Introduction to Machine Learning 2
Density EstimationMaxent Models
pageMehryar Mohri - Introduction to Machine Learning
Entropy
Definition: the entropy of a random variable with probability distribution is
Properties:
• Measure of uncertainty of .
•
• Maximal for uniform distribution. For a finite support, by Jensen’s inequality:
3
X
p(x) = Pr[X = x]
H(X) = !E[log p(X)] = !
!
x
p(x) log p(x).
p(x)
H(X) ! 0.
H(X) = E[log 1/p(X)] ! log E[1/p(X)] = log N.
pageMehryar Mohri - Introduction to Machine Learning
Relative Entropy
Definition: the relative entropy (or Kullback-Leibler divergence) of two distributions and is
with
Properties:• Assymetric measure of deviation between two
distributions. It is convex in and .
• for all and .
• iff
4
p q
D(p ! q) " 0 p q
D(p ! q) = 0 p = q.
D(p ! q) = Ep[logp(X)
q(X)] =
!
x
p(x) logp(x)
q(x),
0 log0
q= 0 and p log
p
0= !.
p q
pageMehryar Mohri - Introduction to Machine Learning
Jensen’s Inequality
Theorem: let be a random variable and a measurable convex function. Then,
Proof:
• For a distribution over a finite set, the property follows directly the definition of convexity.
• The general case is a consequence of the continuity of convex functions and the density of finite distributions.
5
X f
f(E[X]) ! E[f(X)].
pageMehryar Mohri - Introduction to Machine Learning
Applications
Non-negativity of relative entropy:
6
!D(p " q) = Ep[log q(X)p(X) ]
# log Ep[q(X)p(X) ]
= log!
x q(x) = 0.
pageMehryar Mohri - Introduction to Machine Learning
Hoeffding’s Bounds
Theorem: let be a sequence of independent Bernoulli trials taking values in , then for all , the following inequalities hold for
7
X1, X2, . . . , Xm
! > 0
Xm =1
m
!m
i=1Xi
Pr[Xm ! E[Xm] " !!] " e!2m!
2
.
Pr[Xm ! E[Xm] " !] # e!2m!
2
.
[0, 1]
pageMehryar Mohri - Introduction to Machine Learning
Problem
Data: sample drawn i.i.d. from set according to some distribution ,
Problem: find distribution out of a set that best estimates .
8
D
x1, . . . , xm ! X.
X
D
p P
pageMehryar Mohri - Introduction to Machine Learning
Maximum Likelihood
Likelihood: probability of observing sample under distribution , which, given the independence assumption is
Principle: select distribution maximizing sample probability
9
Pr[x1, . . . , xm] =m!
i=1
p(xi).
p ! P
p! = argmaxp!P
m!
i=1
p(xi),
p! = argmaxp!P
m!
i=1
log p(xi).or
pageMehryar Mohri - Introduction to Machine Learning
Relative Entropy Formulation
Empirical distribution: distribution that assigns to each point the frequency of its occurrence in the sample.
Lemma: has maximum likelihood iff
Proof:
10
p̂
p! = argminp!P
D(p̂ ! p).
p!
D(p̂ ! p) =!
zi observed
p̂(zi) log p̂(zi) "!
zi observed
p̂(zi) log p(zi)
= "H(p̂) ""
zi
count(zi)N0
log p(zi)
= "H(p̂) " 1N0
log#
zip(zi)count(zi)
= "H(p̂) " 1N0
l(p).
l(p!)
pageMehryar Mohri - Introduction to Machine Learning
Maximum A Posteriori (MAP)
Principle: select the most likely hypothesis given the sample, with some prior distribution over the hypotheses, ,
Note: for a uniform prior, MAP coincides with maximum likelihood.
11
h ! H
Pr[h]
h! = argmaxh!H
Pr[h | S]
= argmaxh!H
Pr[S | h] Pr[h]
Pr[S]
= argmaxh!H
Pr[S | h] Pr[h].
pageMehryar Mohri - Introduction to Machine Learning
Problem
Data: sample drawn i.i.d. from set according to some distribution ,
Features: mappings associated to elements of ,
Problem: how do we estimate distribution ? Uniform distribution over ?
12
D
X
X
D
Xu
x1, . . . , xm ∈ X.
φ1, . . . , φN : X → R.
pageMehryar Mohri - Introduction to Machine Learning
Features
Examples: statistical language modeling.
• n-grams, distance-d n-grams.
• class-based n-grams, word triggers.
• sentence length.
• number and type of verbs.
• various grammatical information (e.g., agreement, POS tags).
• dialog-level information.
13
pageMehryar Mohri - Introduction to Machine Learning
Maximum Entropy Principle
For large , we can give a fairly good estimate of the expected value of each feature (Hoeffding’s inequality):
Find distribution that is closest to the uniform distribution and that preserves the expected values of features.
Closeness is measured using relative entropy (or Kullback-Liebler divergence).
14
m
u
(E. T. Jaynes, 1957, 1983)
∀j ∈ [1, N ], Ex∼D
[φj(x)] ≈ 1m
m�
i=1
φj(xi).
pageMehryar Mohri - Introduction to Machine Learning
Maximum Entropy Formulation
Distributions: let denote the set of distributions
Optimization problem: find distribution verifying
15
P
p!
p� = argminp∈P
D(p � u).
P =
�p : E
x∼p[φj(x)] =
1m
m�
i=1
φj(xi), j ∈ [1, n]
�.
pageMehryar Mohri - Introduction to Machine Learning
Relation with Entropy Maximization
Relationship with entropy:
Optimization problem: convex problem.
16
D(p � u) =�
x∈X
p(x) logp(x)1/|X |
= log |X | +�
x∈X
p(x) log p(x)
= log |X |−H(p).
minimize�
x∈X
p(x) log p(x)
subject to�
x∈X
p(x) = 1
�
x∈X
p(x)φj(x) =1m
m�
i=1
φj(xi), ∀j ∈ [1, N ].
pageMehryar Mohri - Introduction to Machine Learning
Maximum Likelihood Gibbs Distrib.
Gibbs distributions: set of distributions defined by
Maximum likelihood with Gibbs distributions:
17
with
where is the closure of .Q̄ Q
Q
p� = argmaxq∈Q̄
m�
i=1
log q(xi) = argminq∈Q̄
D(p̂ � q),
p(x) =1Z
exp(w · Φ(x)) =1Z
exp� N�
j=1
wj · Φj(x)�,
Z =�
x
exp(w · Φ(x)).
pageMehryar Mohri - Introduction to Machine Learning
Duality Theorem
Theorem: Assume that Then, there exists a unique probability distribution satisfying1. 2. for any
and (Pythagorean equality);3. (maximum likelihood);
4.
Each of these properties determines uniquely.
18
D(p̂ ! u) < ".
p ! P
q ! Q̄
p!
p! ! P " Q̄;
D(p ! q) = D(p ! p!) + D(p! ! q)
p! = argminp!P
D(p ! u)
p! = argminq!Q̄
D(p̂ ! q)
(Della Pietra et al., 1997)
p!
pageMehryar Mohri - Introduction to Machine Learning
Regularization
Overfitting:
• Features with low counts;
• Very large feature weights.
Approximation parameters: , ;
19
βj≥0 j∈ [1, n]
∀j ∈ [1, n],��� E
x∼D[φj(x)]− 1
m
m�
i=1
φj(xi)��� ≤ βj .
pageMehryar Mohri - Introduction to Machine Learning
Regularized Maxent
Optimization problem:
Regularization: or norm, also interpreted as Gaussian or Laplacian priors (Bayesian view). E.g.,
20
argminw
λ�w�2 − 1m
m�
i=1
log pw[xi],
with pw[x]= 1Z(x) exp(w · Φ(x)).
L1 L2
where Pr[w] =�n
j=1e−w2
j /(2σ2j )√
2πσ2j
.
p̃(x)← pw(x) Pr[w]
l̃(w)← l(w)−n�
j=1
w2j
2σ2j
− 12
n�
j=1
log(2πσ2j ),
pageMehryar Mohri - Introduction to Machine Learning
Extensions - Bregman Divergences
Definition: let be a convex and differentiable function, then the Bregman divergence based on is defined as
Examples:
• Unnormalized relative entropy.
• Euclidean distance.
21
F
F
x y
F (y)
F (x) + (y ! x) ·"xF (x)
BF (y, x) = F (y)− F (x) − (y − x) ·∇xF (x).
pageMehryar Mohri - Introduction to Machine Learning
Conditional Maxent Models
Generalization of Maxent models:
• multi-class classification.
• conditional probability modeling each class.
• different features for each class.
• logistic regression: special case of two classes.
• also known as multinomial logistic regression.
22
pageMehryar Mohri - Introduction to Machine Learning
Problem
Data: sample drawn i.i.d. according to some distribution .
Multi-label case:
Features: for any , feature vector
Problem: find distribution out of set that best estimates the probability distributions , for all classes .
23
p P
D
S = ((x1, y1), . . . , (xm, ym))∈(X×{1, . . . , k})m.
S = ((x1, y1), . . . , (xm, ym))∈(X×{−1, +1}k)m.
y∈{1, . . . , k}
x �→ Φ(x, y).
Pr[y | x]y∈{1, . . . , k}
pageMehryar Mohri - Introduction to Machine Learning
Conditional Maxent
Optimization problem: maximizing conditional entropy.
24
minimize�
y∈[1,k]
�
x∈X
p(y|x) log p(y|x)
subject to ∀y ∈ [1, k],�
x∈X
p(y|x) = 1
∀j ∈ [1, N ],�
x∈X
�
y∈[1,k]
p(y|x)Φj(x, y) =1m
m�
i=1
Φj(xi, yi).
pageMehryar Mohri - Introduction to Machine Learning
Equivalent ML Formulation
Optimization problem: maximize conditional log-likelihood.
25
maxw∈H
1m
m�
i=1
log pw[yi|xi]
subject to: ∀(x, y) ∈ X × [1, k],
pw[y|x]=1
Z(x)exp(w · Φ(x, y))
Z(x)=�
y∈Y
exp(w · Φ(x, y)).
pageMehryar Mohri - Introduction to Machine Learning
Regularized Conditional Maxent
Optimization problem: maximizing conditional entropy, regularization parameter .
26
minw∈H
λ�w�2 − 1m
m�
i=1
log pw[yi|xi]
subject to: ∀(x, y) ∈ X × [1, k],
pw[y|x]=1
Z(x)exp(w · Φ(x, y))
Z(x)=�
y∈Y
exp(w · Φ(x, y)).
λ ≥ 0
Different norms used for regularization.
pageMehryar Mohri - Introduction to Machine Learning
Logistic Regression
Logistic model: .
Properties:
• linear log-odds ratio (or logit):
• decision rule: sign of log-odds ratio.
27
with
(Berkson, 1944)
k=2
logPr[y2 | x]Pr[y1 | x]
= w · (Φ(x, y2)−Φ(x, y1)).
Pr[+1 | x] =1
1 + e−w·(Φ(x,+1)−Φ(x,−1).
Pr[y | x] =1
Z(x)ew·Φ(x,y),
Z(x) =�
y∈[1,k]
ew·Φ(x,y).
pageMehryar Mohri - Introduction to Machine Learning
References• Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. A maximum entropy
approach to natural language processing. Computational Linguistics, (22-1), March 1996;
• Berkson, J. (1944). Application of the logistic function to bio-assay. Journal of the American Statistical Association 39, 357–365.
• Imre Csiszar and Tusnady. Information geometry and alternating minimization procedures. Statistics and Decisions, Supplement Issue 1, 205-237, 1984.
• Imre Csiszar. A geometric interpretation of Darroch and Ratchliff's generalized iterative scaling. The Annals of Statisics, 17(3), pp. 1409-1413. 1989.
• J. Darroch and D. Ratchliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5), pp. 1470-1480, 1972.
• Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:4, pp.380--393, April, 1997.
28
pageMehryar Mohri - Introduction to Machine Learning
References• Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Duality and auxiliary functions
for Bregman distances. Technical Report CMU-CS-01-109, School of Computer Science, CMU, 2001.
• E. Jaynes. Information theory and statistical mechanics. Physics Reviews, 106:620–630, 1957.
• E. Jaynes. Papers on Probability, Statistics, and Statistical Physics. R. Rosenkrantz (editor), D. Reidel Publishing Company, 1983.
• O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectation-maximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, 1998.
• Roni Rosenfeld. A maximum entropy approach to adaptive statistical language modelling.Computer, Speech and Language 10:187--228, 1996.
29