Pattern - Michigan State Universitycse802/DHSch3.pdf · Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher. Chapter 3:

Pattern Classification

All materials in these slides were taken from

Pattern Classification (2nd ed) by R. O.

Duda, P. E. Hart and D. G. Stork, John Wiley

& Sons, 2000

with the permission of the authors and the publisher

Chapter 3:Maximum-Likelihood & Bayesian

Parameter Estimation

l Introduction

l Maximum-Likelihood Estimation

l Bayesian Estimation

l Curse of Dimensionality

l Component analysis & Discriminants

l EM Algorithm

Pattern Classification, Chapter 3

2

l Introduction

l Bayesian framework

l We could design an optimal classifier if we knew:

l P(ωi) : priors

l P(x | ωi) : class-conditional densities

Unfortunately, we rarely have this complete information!

l Design a classifier based on a set of labeled

training samples (supervised learning)

l Assume priors are known

l Need sufficient no. of training samples for estimating

class-conditional densities, especially when the

dimensionality of the feature space is large

1


3

l Assumption about the problem: parametric model of P(x | ωi) is available

l Normality of P(x | ωi)

P(x | ωi) ~ N( µi, Σi)

l Characterized by 2 parameters

l Estimation techniquesl Maximum-Likelihood (ML) and Bayesian estimation

l Results of the two procedures are nearly identical, but the approaches are different

1


4

l Parameters in ML estimation are fixed but

unknown! Bayesian parameter estimation

procedure, by its nature, utilizes whatever prior

information is available about the unknown

parameter

l MLE: Best parameters are obtained by maximizing

the probability of obtaining the samples observed

l Bayesian methods view the parameters as random

variables having some known prior distribution; How

do we know the priors?

l In either approach, we use P(ωi | x) for our classification rule!

1


5

l Maximum-Likelihood Estimationl Has good convergence properties as the

sample size increases; estimated parameter

value approaches the true value as n increases

l Simpler than any other alternative technique

l General principle

l Assume we have c classes and

P(x | ωj) ~ N( µj, Σj)

P(x | ωj) ≡ P (x | ωj, θj), where

)...)x,xcov(,,,...,,(),( nj

mj

22j

11j

2j

1jjj σσσσσσσσµµµµµµµµ====ΣΣΣΣµµµµ====θθθθ

2

Use class ωj samples to estimate class ωj parameters


6l Use the information in training samples to estimate

θ = (θ1, θ2, …, θc); θi (i = 1, 2, …, c) is associated with the ith category

l Suppose sample set D contains n iid samples, x1, x2,…, xn

l ML estimate of θ is, by definition, the value that maximizes P(D | θ)

“It is the value of θ that best agrees with the actually observed training samples”

samples) ofset the w.r.t. of likelihood the called is )|D(P

)(F)|x(P)|D(Pnk

1kk

θθθθθθθθ

θθθθ====θθθθ∏∏∏∏====θθθθ====

====

θθθθ̂

2


7

2


8

l Optimal estimation

l Let θ = (θ1, θ2, …, θp)t and ∇θ be the gradient operator

l We define l(θ) as the log-likelihood function

l(θ) = ln P(D | θ)

l New problem statement:

determine θ that maximizes the log-likelihood

t

p21

,...,,

θθθθ∂∂∂∂

∂∂∂∂


∂∂∂∂


∂∂∂∂====∇∇∇∇θθθθ

)(lmaxargˆ θθθθ====θθθθθθθθ

2


9

Set of necessary conditions for an optimum is:

∇θl = 0

))|x(Plnl( k

nk

1k

θθθθ∑∑∑∑∇∇∇∇====∇∇∇∇====

====θθθθθθθθ

2


10

l Example of a specific case: unknown µ

l P(x | µ) ~ N(µ, Σ)

(Samples are drawn from a multivariate normal population)

θ = µ, therefore the ML estimate for µ must satisfy:

[[[[ ]]]]

)x()|x(Pln and

)x()x(2

1)2(ln

2

1)|x(Pln

1

kk

1

kt

kd

k

µµµµ−−−−∑∑∑∑====µµµµ∇∇∇∇

µµµµ−−−−∑∑∑∑µµµµ−−−−−−−−ΣΣΣΣππππ−−−−====µµµµ

−−−−

θµθµθµθµ

−−−−

0)ˆx( k

nk

1k

1 ====µµµµ−−−−∑∑∑∑ ΣΣΣΣ====

====

−−−−

2


11

• Multiplying by Σ and rearranging, we obtain:

which is the arithmetic average or the mean of

the samples of the training samples!

Conclusion:

Given P(xk | ωj), j = 1, 2, …, c to be Gaussian in a d-

dimensional feature space, estimate the vector θ = (θ1, θ2, …, θc)

t and perform a classification using the Bayes decision rule of chapter 2!

∑∑∑∑====µµµµ====

====

nk

1kkx

n

1ˆ

2


12

l ML Estimation:

l Univariate Gaussian Case: unknown µ & σθ = (θ1, θ2) = (µ, σ2)

====θθθθ

θθθθ−−−−++++

θθθθ−−−−

====θθθθ−−−−θθθθ

====

θθθθσθσθσθσθ

σσσσ

θθθθσθσθσθσθ

σσσσ

====∇∇∇∇

θθθθ−−−−θθθθ

−−−−πθπθπθπθ−−−−====θθθθ====

θθθθ

02

)x(

2

1

0)x(1

0

))|x(P(ln

))|x(P(ln

l

)x(2

12ln

2

1)|x(Plnl

22

21k

2

1k2

k2

k1

21k

22k

2


13

Summation:

Combining (1) and (2), one obtains:

n

)x(

; n

x

nk

1k

2k

2nk

1k

k∑∑∑∑ µµµµ−−−−

====σσσσ∑∑∑∑====µµµµ

====

========

====

∑∑∑∑ ∑∑∑∑ ====θθθθ

θθθθ−−−−++++

θθθθ−−−−

∑∑∑∑ ====θθθθ−−−−θθθθ

====

====

====

====

====

====

nk

1k

nk

1k22

21k

2

nk

1k1k

2

(2) 0ˆ

)ˆx(

ˆ1

(1) 0)x(ˆ1

2


14

l Bias

l ML estimate for σ2 is biased

l An unbiased estimator for Σ is:

222i .

n

1n)xx(

n

1E σσσσ≠≠≠≠σσσσ

−−−−====

−−−−ΣΣΣΣ

��

matrix covariance Sample

nk

1k

tkk )ˆx)(x(

1-n

1C ∑∑∑∑ µµµµ−−−−µµµµ−−−−====

====

====

2


15l Bayesian Estimation (Bayesian learning approach

for pattern classification problems)

l In MLE θ was supposed to have a fixed value

l In BE θ is a random variable

l The computation of posterior probabilities P(ωi | x) lies at the heart of Bayesian classification

l Goal: compute P(ωi | x, D)

Given the training sample set D, Bayes formula

can be written

∑∑∑∑ ωωωωωωωω

ωωωωωωωω====ωωωω

====

c

1jjj

iii

)|(P).,|x(P

)|(P).,|x(P),x|(P

DD

DDD

3


16

l To demonstrate the preceding equation, use:

)(P).,|x(P

)(P).,|x(P),x|(P

:Thus

)this! provides sample (Training )|(P)(P

)|,x(P)|x(P

)|(P).|x(P)|,x(P

c

1jjj

iiii

ii

jj

iii

∑∑∑∑ ωωωωωωωω


ωωωω====ωωωω

∑∑∑∑ ωωωω====


====

D

DD

D

DD

DDD

3


17

l Bayesian Parameter Estimation: Gaussian Case

Goal: Estimate θ using the a-posteriori density

P(θ | D)

l The univariate Gaussian case: P(µ | D)

µ is the only unknown parameter

µ0 and σ0 are known!

),N( ~ )P(

),N( ~ ) | P(x

200

2

σσσσµµµµµµµµ

σσσσµµµµµµµµ

4


18

l Reproducing density

The updated parameters of the prior:

∏∏∏∏ µµµµµµµµαααα====

∫∫∫∫ µµµµµµµµµµµµ

µµµµµµµµ====µµµµ

====

====

nk

1kk )(P).|x(P

(1) d)(P).|(P

)(P).|(P)|(P

D

DD

(2) ),(N~)|(P 2nn σσσσµµµµµµµµ D

220

2202

n

0220

2

n2200

20

n

n and

.n

ˆ n

n

σσσσ++++σσσσ

σσσσσσσσ====σσσσ

µµµµσσσσ++++σσσσ

σσσσ++++µµµµ

σσσσ++++σσσσ

σσσσ====µµµµ

4


19

4


20l The univariate case P(x | D)

l P(µ | D) has been computed

l P(x | D) remains to be computed!

It provides:

Desired class-conditional density P(x | Dj, ωj)

P(x | Dj, ωj) together with P(ωj) and using Bayes formula, we obtain the Bayesian classification rule:

Gaussian is d)|(P).|x(P)|x(P µµµµµµµµ∫∫∫∫ µµµµ==== DD

),(N~)|x(P 2n

2n σσσσ++++σσσσµµµµD

[[[[ ]]]] [[[[ ]]]])(P).,|x(PMax,x|(PMax jjjj

jj

ωωωωωωωω≡≡≡≡ωωωωωωωωωωωω

DD

4


21

l Bayesian Parameter Estimation: General Theory

l P(x | D) computation can be applied to any

situation in which the unknown density can be

parametrized: the basic assumptions are:

l The form of P(x | θ) is assumed known, but the value of

θ is not known exactly

l Our knowledge about θ is assumed to be contained in

a known prior density P(θ)

l The rest of our knowledge about θ is contained in a set D of n random variables x1, x2, …, xn that follows P(x)

5


22

The basic problem is:

“Compute the posterior density P(θ | D)”

then “Derive P(x | D)”

Using Bayes formula, we have:

And by independence assumption:

)|x(P)|(P k

nk

1k

θθθθ∏∏∏∏====θθθθ====

====

D

,d)(P).|(P

)(P).|(P)|(P

∫∫∫∫ θθθθθθθθθθθθ

θθθθθθθθ====θθθθ

D

DD

5


23l Problem of Dimensionality

l Classification problems with large number of

features (hundreds or even thousands) is

frequently encountered

l Classification accuracy depends upon the the number of features (dimensionality), their discrimination ability

and the amount of training data

l Two-class multivariate normal with the same covariance

matrix

0)error(Plim

)()(r :where

due2

1)error(P

r

211t

212

2

2u

2/r

====

µµµµ−−−−µµµµΣΣΣΣµµµµ−−−−µµµµ====

∫∫∫∫ππππ

====

∞∞∞∞→→→→

−−−−

−−−−∞∞∞∞

7


24l If features are independent then:

l This shows how each feature contributes to reducing the probability of error. Most useful features are the ones for which the difference between the means is large relative to the standard deviation

l An obvious way to reduce the error further is to introduce new, independent feature.

l It has frequently been observed in practice that, when the number of training samples is small, beyond a certain point, the inclusion of additional features leads to worse rather than better performance: (i) we have the wrong model, (ii) small no. of available training samples to estimate parameters

2di

1i i

2i1i2

2d

22

21

r

),...,,(diag

∑∑∑∑

σσσσ

µµµµ−−−−µµµµ====

σσσσσσσσσσσσ====ΣΣΣΣ

====

====

7


25

77

7


26

l Component Analysis and Discriminants

l Combine features in order to reduce the dimension of the feature space

l Linear combinations are simple to compute and tractable

l Project high dimensional data onto a lower dimensional space

l Two classical approaches for finding “optimal”linear transformation

l PCA (Principal Component Analysis) “Projection that best represents the data in a least- square sense”

l MDA (Multiple Discriminant Analysis) “Projection that best separates the data in a least-squares sense”

8

Documents

Pattern - Michigan State Universitycse802/DHSch3.pdf · Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher. Chapter 3: