240-650: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha [email protected] . ac.th/~montri

240-650: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation

1

Montri [email protected]://fivedots.coe.psu.ac.th/~montri

240-650 Principles of Pattern

Recognition


2

Chapter 3

Maximum-Likelihood and Bayesian Parameter

Estimation


3

Introduction

• We could design an optimum classifier if we know P(i) and p(x|i)

• We rarely have knowledge about the probabilistic structure of the problem

• We often estimate P(i) and p(x|i) from training data or design samples


4

Maximum-Likelihood Estimation

• ML Estimation

• Always have good convergence properties as the number of training samples increases

• Simpler that other methods


5

The General Principle

• Suppose we separate a collection of samples according to class so that we have c data sets, D1, …, Dc with the samples in Dj having been drawn independently according to the probability law p(x|j)

• We say such samples are i.i.d.– independently and identically distributed random variable


6

The General Principle

• We assume that p(x|j) has a known parametric form and is determined uniquely by the value of a parameter vector j

• For example

• We explicitly write p(x|j) as p(x|j, j)

jjj Np Σμx ,~)|(


7

Problem Statement

• To use the information provided by the training samples to obtain good estimates for the unknown parameter vectors 1,…c associated with each category


8

Simplified Problem Statement

• If samples in Di give no information about j if i = j

• We now have c separated problems of the following form:

To use a set D of training samples drawn independently from the probability density p(x|) to estimate the unknown vector .


9

• Suppose that D contains n samples, x1,…,xn.

• Then we have

• The Maximum-Likelihood estimate of is the value of that maximizes p(D|)

n

kkpDp

1

)|()|( θxθ

Likelihood of q with respect to

the set of samples

θ̂


10


11

• Let = (1, …, p)t

• Let be the gradient operator

p

.

.1


12

Log-Likelihood Function

• We define l() as the log-likelihood function

• We can write our solution as

)|(ln)( θθ Dpl

)(maxargˆ θθ l


13

MLE

• From

• We have

• And

• Necessary condition for MLE

n

kkpDp

1

)|()|( θxθ

n

kkpl

1

)|(ln)( θxθ

n

kkpl

1

)|(ln θx

0 l


14

The Gaussian Case: Unknown

• Suppose that the samples are drawn from a multivariate normal population with mean and covariance matrix

• Let is the only unknown

• Consider a sample point xk and find

• and

μxΣμxΣμx k

tk

dkp 1

2

1)2(ln

2

1)|(ln

μxΣμxμ kkp 1)|(ln


15

• The MLE of must satisfy

• After rearranging

n

kk

1

1 0μ̂xΣ

n

kkn 1

1ˆ xμ


16

Sample Mean

• The MLE for the unknown population meanis just the arithmetic average of the training samples (or sample mean)

• If we think of the n samples as a cloud of points, then the sample mean is the centroid of the cloud


17

The Gaussian Case: Unknown and

• This is a more typical case where mean and covariance matrix are unknown

• Consider the univariate case with 1= and 2=2

212

2 2

12ln

2

1)|(ln

kk xxp θ


18

• And its derivative is

• Set to 0

• and

22

21

2

12

22

1

1

|ln

k

k

k x

x

xpl θθθ

n

kkx

11

2

0ˆˆ1

n

k

n

k

kx

1 122

2

1

2

0ˆ

ˆ

ˆ1


19

• With a little rearranging, we have

n

kkxn

μ1

1ˆ

n

kkxn 1

22 )ˆ(1

ˆ


20

MLE for multivariate case

n

kkn 1

1ˆ xμ

n

k

tkkn 1

)ˆ)(ˆ(1ˆ μxμxΣ


21

Bias

• The MLE for the variance 2 is biased• The expected value over all data sets of

size n of the sample variance is not equal to the true variance

• An Unbiased estimator for is given by

22

1

2 11

n

nxx

n

n

ii

n

k

tkkn 1

ˆˆ1

1μxμxC

Documents

240-650: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha [email protected] . ac.th/~montri