Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Pattern Classification
All materials in these slides were taken from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and the publisher
Chapter 3:Maximum-Likelihood & Bayesian
Parameter Estimation
l Introduction
l Maximum-Likelihood Estimation
l Bayesian Estimation
l Curse of Dimensionality
l Component analysis & Discriminants
l EM Algorithm
Pattern Classification, Chapter 3
2
l Introduction
l Bayesian framework
l We could design an optimal classifier if we knew:
l P(ωi) : priors
l P(x | ωi) : class-conditional densities
Unfortunately, we rarely have this complete information!
l Design a classifier based on a set of labeled
training samples (supervised learning)
l Assume priors are known
l Need sufficient no. of training samples for estimating
class-conditional densities, especially when the
dimensionality of the feature space is large
1
Pattern Classification, Chapter 3
3
l Assumption about the problem: parametric model of P(x | ωi) is available
l Normality of P(x | ωi)
P(x | ωi) ~ N( µi, Σi)
l Characterized by 2 parameters
l Estimation techniquesl Maximum-Likelihood (ML) and Bayesian estimation
l Results of the two procedures are nearly identical, but the approaches are different
1
Pattern Classification, Chapter 3
4
l Parameters in ML estimation are fixed but
unknown! Bayesian parameter estimation
procedure, by its nature, utilizes whatever prior
information is available about the unknown
parameter
l MLE: Best parameters are obtained by maximizing
the probability of obtaining the samples observed
l Bayesian methods view the parameters as random
variables having some known prior distribution; How
do we know the priors?
l In either approach, we use P(ωi | x) for our classification rule!
1
Pattern Classification, Chapter 3
5
l Maximum-Likelihood Estimationl Has good convergence properties as the
sample size increases; estimated parameter
value approaches the true value as n increases
l Simpler than any other alternative technique
l General principle
l Assume we have c classes and
P(x | ωj) ~ N( µj, Σj)
P(x | ωj) ≡ P (x | ωj, θj), where
)...)x,xcov(,,,...,,(),( nj
mj
22j
11j
2j
1jjj σσσσσσσσµµµµµµµµ====ΣΣΣΣµµµµ====θθθθ
2
Use class ωj samples to estimate class ωj parameters
Pattern Classification, Chapter 3
6l Use the information in training samples to estimate
θ = (θ1, θ2, …, θc); θi (i = 1, 2, …, c) is associated with the ith category
l Suppose sample set D contains n iid samples, x1, x2,…, xn
l ML estimate of θ is, by definition, the value that maximizes P(D | θ)
“It is the value of θ that best agrees with the actually observed training samples”
samples) ofset the w.r.t. of likelihood the called is )|D(P
)(F)|x(P)|D(Pnk
1kk
θθθθθθθθ
θθθθ====θθθθ∏∏∏∏====θθθθ====
====
θθθθ̂
2
Pattern Classification, Chapter 3
7
2
Pattern Classification, Chapter 3
8
l Optimal estimation
l Let θ = (θ1, θ2, …, θp)t and ∇θ be the gradient operator
l We define l(θ) as the log-likelihood function
l(θ) = ln P(D | θ)
l New problem statement:
determine θ that maximizes the log-likelihood
t
p21
,...,,
θθθθ∂∂∂∂
∂∂∂∂
θθθθ∂∂∂∂
∂∂∂∂
θθθθ∂∂∂∂
∂∂∂∂====∇∇∇∇θθθθ
)(lmaxargˆ θθθθ====θθθθθθθθ
2
Pattern Classification, Chapter 3
9
Set of necessary conditions for an optimum is:
∇θl = 0
))|x(Plnl( k
nk
1k
θθθθ∑∑∑∑∇∇∇∇====∇∇∇∇====
====θθθθθθθθ
2
Pattern Classification, Chapter 3
10
l Example of a specific case: unknown µ
l P(x | µ) ~ N(µ, Σ)
(Samples are drawn from a multivariate normal population)
θ = µ, therefore the ML estimate for µ must satisfy:
[[[[ ]]]]
)x()|x(Pln and
)x()x(2
1)2(ln
2
1)|x(Pln
1
kk
1
kt
kd
k
µµµµ−−−−∑∑∑∑====µµµµ∇∇∇∇
µµµµ−−−−∑∑∑∑µµµµ−−−−−−−−ΣΣΣΣππππ−−−−====µµµµ
−−−−
θµθµθµθµ
−−−−
0)ˆx( k
nk
1k
1 ====µµµµ−−−−∑∑∑∑ ΣΣΣΣ====
====
−−−−
2
Pattern Classification, Chapter 3
11
• Multiplying by Σ and rearranging, we obtain:
which is the arithmetic average or the mean of
the samples of the training samples!
Conclusion:
Given P(xk | ωj), j = 1, 2, …, c to be Gaussian in a d-
dimensional feature space, estimate the vector θ = (θ1, θ2, …, θc)
t and perform a classification using the Bayes decision rule of chapter 2!
∑∑∑∑====µµµµ====
====
nk
1kkx
n
1ˆ
2
Pattern Classification, Chapter 3
12
l ML Estimation:
l Univariate Gaussian Case: unknown µ & σθ = (θ1, θ2) = (µ, σ2)
====θθθθ
θθθθ−−−−++++
θθθθ−−−−
====θθθθ−−−−θθθθ
====
θθθθσθσθσθσθ
σσσσ
θθθθσθσθσθσθ
σσσσ
====∇∇∇∇
θθθθ−−−−θθθθ
−−−−πθπθπθπθ−−−−====θθθθ====
θθθθ
02
)x(
2
1
0)x(1
0
))|x(P(ln
))|x(P(ln
l
)x(2
12ln
2
1)|x(Plnl
22
21k
2
1k2
k2
k1
21k
22k
2
Pattern Classification, Chapter 3
13
Summation:
Combining (1) and (2), one obtains:
n
)x(
; n
x
nk
1k
2k
2nk
1k
k∑∑∑∑ µµµµ−−−−
====σσσσ∑∑∑∑====µµµµ
====
========
====
∑∑∑∑ ∑∑∑∑ ====θθθθ
θθθθ−−−−++++
θθθθ−−−−
∑∑∑∑ ====θθθθ−−−−θθθθ
====
====
====
====
====
====
nk
1k
nk
1k22
21k
2
nk
1k1k
2
(2) 0ˆ
)ˆx(
ˆ1
(1) 0)x(ˆ1
2
Pattern Classification, Chapter 3
14
l Bias
l ML estimate for σ2 is biased
l An unbiased estimator for Σ is:
222i .
n
1n)xx(
n
1E σσσσ≠≠≠≠σσσσ
−−−−====
−−−−ΣΣΣΣ
����� ������ ��
matrix covariance Sample
nk
1k
tkk )ˆx)(x(
1-n
1C ∑∑∑∑ µµµµ−−−−µµµµ−−−−====
====
====
2
Pattern Classification, Chapter 1
15l Bayesian Estimation (Bayesian learning approach
for pattern classification problems)
l In MLE θ was supposed to have a fixed value
l In BE θ is a random variable
l The computation of posterior probabilities P(ωi | x) lies at the heart of Bayesian classification
l Goal: compute P(ωi | x, D)
Given the training sample set D, Bayes formula
can be written
∑∑∑∑ ωωωωωωωω
ωωωωωωωω====ωωωω
====
c
1jjj
iii
)|(P).,|x(P
)|(P).,|x(P),x|(P
DD
DDD
3
Pattern Classification, Chapter 1
16
l To demonstrate the preceding equation, use:
)(P).,|x(P
)(P).,|x(P),x|(P
:Thus
)this! provides sample (Training )|(P)(P
)|,x(P)|x(P
)|(P).|x(P)|,x(P
c
1jjj
iiii
ii
jj
iii
∑∑∑∑ ωωωωωωωω
ωωωωωωωω====ωωωω
ωωωω====ωωωω
∑∑∑∑ ωωωω====
ωωωωωωωω====ωωωω
====
D
DD
D
DD
DDD
3
Pattern Classification, Chapter 1
17
l Bayesian Parameter Estimation: Gaussian Case
Goal: Estimate θ using the a-posteriori density
P(θ | D)
l The univariate Gaussian case: P(µ | D)
µ is the only unknown parameter
µ0 and σ0 are known!
),N( ~ )P(
),N( ~ ) | P(x
200
2
σσσσµµµµµµµµ
σσσσµµµµµµµµ
4
Pattern Classification, Chapter 1
18
l Reproducing density
The updated parameters of the prior:
∏∏∏∏ µµµµµµµµαααα====
∫∫∫∫ µµµµµµµµµµµµ
µµµµµµµµ====µµµµ
====
====
nk
1kk )(P).|x(P
(1) d)(P).|(P
)(P).|(P)|(P
D
DD
(2) ),(N~)|(P 2nn σσσσµµµµµµµµ D
220
2202
n
0220
2
n2200
20
n
n and
.n
ˆ n
n
σσσσ++++σσσσ
σσσσσσσσ====σσσσ
µµµµσσσσ++++σσσσ
σσσσ++++µµµµ
σσσσ++++σσσσ
σσσσ====µµµµ
4
Pattern Classification, Chapter 1
19
4
Pattern Classification, Chapter 1
20l The univariate case P(x | D)
l P(µ | D) has been computed
l P(x | D) remains to be computed!
It provides:
Desired class-conditional density P(x | Dj, ωj)
P(x | Dj, ωj) together with P(ωj) and using Bayes formula, we obtain the Bayesian classification rule:
Gaussian is d)|(P).|x(P)|x(P µµµµµµµµ∫∫∫∫ µµµµ==== DD
),(N~)|x(P 2n
2n σσσσ++++σσσσµµµµD
[[[[ ]]]] [[[[ ]]]])(P).,|x(PMax,x|(PMax jjjj
jj
ωωωωωωωω≡≡≡≡ωωωωωωωωωωωω
DD
4
Pattern Classification, Chapter 1
21
l Bayesian Parameter Estimation: General Theory
l P(x | D) computation can be applied to any
situation in which the unknown density can be
parametrized: the basic assumptions are:
l The form of P(x | θ) is assumed known, but the value of
θ is not known exactly
l Our knowledge about θ is assumed to be contained in
a known prior density P(θ)
l The rest of our knowledge about θ is contained in a set D of n random variables x1, x2, …, xn that follows P(x)
5
Pattern Classification, Chapter 1
22
The basic problem is:
“Compute the posterior density P(θ | D)”
then “Derive P(x | D)”
Using Bayes formula, we have:
And by independence assumption:
)|x(P)|(P k
nk
1k
θθθθ∏∏∏∏====θθθθ====
====
D
,d)(P).|(P
)(P).|(P)|(P
∫∫∫∫ θθθθθθθθθθθθ
θθθθθθθθ====θθθθ
D
DD
5
Pattern Classification, Chapter 1
23l Problem of Dimensionality
l Classification problems with large number of
features (hundreds or even thousands) is
frequently encountered
l Classification accuracy depends upon the the number of features (dimensionality), their discrimination ability
and the amount of training data
l Two-class multivariate normal with the same covariance
matrix
0)error(Plim
)()(r :where
due2
1)error(P
r
211t
212
2
2u
2/r
====
µµµµ−−−−µµµµΣΣΣΣµµµµ−−−−µµµµ====
∫∫∫∫ππππ
====
∞∞∞∞→→→→
−−−−
−−−−∞∞∞∞
7
Pattern Classification, Chapter 1
24l If features are independent then:
l This shows how each feature contributes to reducing the probability of error. Most useful features are the ones for which the difference between the means is large relative to the standard deviation
l An obvious way to reduce the error further is to introduce new, independent feature.
l It has frequently been observed in practice that, when the number of training samples is small, beyond a certain point, the inclusion of additional features leads to worse rather than better performance: (i) we have the wrong model, (ii) small no. of available training samples to estimate parameters
2di
1i i
2i1i2
2d
22
21
r
),...,,(diag
∑∑∑∑
σσσσ
µµµµ−−−−µµµµ====
σσσσσσσσσσσσ====ΣΣΣΣ
====
====
7
Pattern Classification, Chapter 1
25
77
7
Pattern Classification, Chapter 1
26
l Component Analysis and Discriminants
l Combine features in order to reduce the dimension of the feature space
l Linear combinations are simple to compute and tractable
l Project high dimensional data onto a lower dimensional space
l Two classical approaches for finding “optimal”linear transformation
l PCA (Principal Component Analysis) “Projection that best represents the data in a least- square sense”
l MDA (Multiple Discriminant Analysis) “Projection that best separates the data in a least-squares sense”
8