•Objectives: Overview General Case Gaussian Cases Resources: DHS – Chap. 3 (Part 1)

• Objectives:

OverviewGeneral CaseGaussian Cases

• Resources:

DHS – Chap. 3 (Part 1)

AM – TutorialAM – LinksBGIM – PrimerCSRN – UnbiasedDM – Bias• URL: .../publications/courses/ece_8443/lectures/current/lecture_10.ppt

ECE 8443 – Pattern Recognition

LECTURE 10: MAXIMUM LIKELIHOOD ESTIMATION

http://www-2.cs.cmu.edu/~awm/tutorials/mle.html

http://www-2.cs.cmu.edu/~awm/tutorials/list.html

http://statgen.iop.kcl.ac.uk/bgim/mle/sslike_1.html

http://gim.unmc.edu/dxtests/ROC1.htm

http://cnx.rice.edu/content/m11426/latest/

http://www.weibull.com/LifeDataWeb/image/apa_fig3.gif

• In Chapter 2, we learned how to design an optimal classifier if we knew the prior probabilities, P(i), and class-conditional densities, p(x|i).

• What can we do if we do not have this information?

• What limitations do we face?

• There are two common approaches to parameter estimation: maximum-likelihood and Bayesian estimation.

• Maximum likelihood: treat the parameters as quantities whose values are fixed but unknown.

• Bayes: treat the parameters as random variables having some known prior distribution. Observations of samples converts this to a posterior.

• Bayesian learning: sharpen the a posteriori density causing it to peak near the true value.

10: MAXIMUM LIKELIHOOD ESTIMATIONINTRODUCTION

10: MAXIMUM LIKELIHOOD ESTIMATIONGENERAL PRINCIPLE

• I.I.D.: c data sets, D1,...,Dc, where Dj drawn independently according to p(x|j).

• Assume p(x|j) has a known parametric form and is completely determined by the parameter vector j (e.g., p(x|j) N(j,j), where j=[1, ..., j , 11, 12, ...,dd])

• p(x|j) has an explicit dependence on j: p(x|j,j)

• Use training samples to estimate 1, 2,..., c

• Functional independence: assume Di gives no useful information about j for ij

• Simplifies notation to a set D of training samples (x1,...xn) drawn independently from p(x|) to estimate .

• Because the samples were drawn independently:)(p)|D(p

n

kk

1x

10: MAXIMUM LIKELIHOOD ESTIMATIONEXAMPLE

• p(D|) is called the likelihood of with respect to the data.

• Given several training points

• Top: candidate source distributions are shown

• Which distribution is the ML estimate?

• Middle: an estimate of the likelihood of the data as a function of (the mean)

• Bottom: log likelihood

• The value of that maximizes this likelihood, denoted ,

is the maximum likelihood estimate (ML) of .

10: MAXIMUM LIKELIHOOD ESTIMATIONGENERAL MATHEMATICS

n

kk

n

kk

θ

p

p

pln

))(pln(

θlmaxargˆ

θDplnθl

.

.),...,,(

1

1

1

21

:Define

Let

Let

x

x

t• The ML estimate is found

by solving this equation:

.pln

]pln[l

n

kk

n

kk

01

1

x

x

• The solution to this equation can be a global maximum, a local maximum, or even an inflection point.

• Under what conditions is it a global maximum?

10: MAXIMUM LIKELIHOOD ESTIMATIONMAXIMUM A POSTERIORI (MAP)

• A class of estimators – maximum a posteriori (MAP) – maximize where describes the prior probability of different parameter values.

pl p

• An ML estimator is a MAP estimator for uniform priors.

• A MAP estimator finds the peak, or mode, of a posterior density.

• MAP estimators are not transformation invariant (if we perform a nonlinear transformation of the input data, the estimator is no longer optimum in the new space). This observation will be useful later in the course.

10: MAXIMUM LIKELIHOOD ESTIMATIONGAUSSIAN CASE: UNKNOWN MEAN

• Consider the case where only the mean, = , is unknown:

)()(])ln[(

)]()(exp[)(

ln[))(pln(

kkd

kk//d

xx

xxx

t

tk

1

1212

21

221

21

2

1

01

n

kkpln x

)())(pln( k xxk

1which implies:

)(

)]()([])ln[([

)]()(])ln[([

k

kkd

kkd

x

xx

xx

t

t

1

1

1

21

221

21

221because:

• Rearranging terms:

• Significance???

10: MAXIMUM LIKELIHOOD ESTIMATIONGAUSSIAN CASE: UNKNOWN MEAN

n

kk

n

kk

n

k

n

kk

n

kk

n

kk

nˆ

ˆn

ˆ

)ˆ(

)ˆ(

1

1

1 1

1

1

1

1

0

0

0

0

x

x

x

x

x

• Substituting into the expresssion for the total likelihood:

01

1

1

n

kk

n

kk )(plnl xx

10: MAXIMUM LIKELIHOOD ESTIMATIONUNKNOWN MEAN AND VARIANCE

• Let = [,2]:

)()(])ln[())(pln( kk 11

212 21

221 xxx t

k θ

22

21

2

12

221

1

)x(

)x(

))(pln(k

k

θkx

• The full likelihood leads to:

n

k

n

kk

n

k

k

n

kk

ˆ)ˆx(ˆ

)ˆx(ˆ

)ˆx(ˆ

12

1

21

122

21

2

11

2

022

1

01

10: MAXIMUM LIKELIHOOD ESTIMATIONUNKNOWN MEAN AND VARIANCE

• This leads to these equations:

2

1

22

11

1

1

)ˆxn

ˆˆ

xn

ˆˆ

n

kk

n

kk

(

• In the multivariate case:

n

kkk

n

kk

ˆˆn

ˆ

nˆ

1

2

1

1

1

txx

x

The true covariance is the expected value of the

matrix , which is a familiar result. txx ˆˆ kk

10: MAXIMUM LIKELIHOOD ESTIMATIONCONVERGENCE OF THE MEAN

• Does the maximum likelihood estimate of the variance converge to the true value of the variance? Let’s start with a few simple results we will need later.

• Expected value of the ML estimate of the mean:

n

i

n

ii

n

ii

n

]x[En

]xn[E]ˆ[E

1

1

1

1

1

1

22

1 12

2

11

22

22

1

11

n

i

n

jji

n

jj

n

ii

]xx[En

]xn

xn

[E

]ˆ[E

])ˆ[E(]ˆ[E]ˆvar[

10: MAXIMUM LIKELIHOOD ESTIMATIONVARIANCE OF ML ESTIMATE OF THE MEAN

• The expected value of xixj will be 2 for j k since the two random variables are independent.

• The expected value of xi2 will be 2 + 2.

• Hence, in the summation above, we have n2-n terms with expected value 2 and n terms with expected value 2 + 2.

• Thus, n

nnnn

]ˆvar[2

222222

1

• We see that the variance of the estimate goes to zero as n goes to infinity, and our estimate converges to the true estimate (error goes to zero).

which implies: 22

22 n

])ˆ[E(]ˆvar[]ˆ[E

2

11

2

22

222

2222

1

2

2

)xn(x

]x[E

][E]x[E

][E]x[E]x[E)x[(E

n

ii

n

ii

Note that this implies: 22

1

2

n

iix

• Now we can combine these results. Recall our expression for the ML estimate of the variance:

n

ii ]ˆx

n[Eˆ

1

22 1

10: MAXIMUM LIKELIHOOD ESTIMATIONVARIANCE RELATIONSHIPS

• We will need one more result:

10: MAXIMUM LIKELIHOOD ESTIMATIONCOVARIANCE EXPANSION

))n(]ˆx[E)((n

])ˆ[E]ˆx[E]x[E(n

)]ˆˆxx([En

ˆxn[Eˆ

i

n

i

n

iii

i

n

ii

n

ii

22

1

22

2

1

2

2

1

2

1

22

21

21

211

• Expand the covariance and simplify:

))n((n

])xx[E]xx[E(n

]xx[En

]xx[E]ˆx[E

ii

n

jij

ji

n

jji

n

jjii

222

1

11

111

1

• One more intermediate term to derive:

10: MAXIMUM LIKELIHOOD ESTIMATIONBIASED VARIANCE ESTIMATE

2

1

2

1

22

1

2

22

1

22

2222

1

22

22

1

222

1

1111

11

1

21

21

n)n(

n)n(

n)n/(

n)n(

n

)n(n

))n()n()((n

))n(]ˆx[E)((n

ˆ

n

i

n

i

n

i

n

i

n

i

i

n

i

• Substitute our previously derived expression for the second term:

22

1

22 11 n

n]ˆx

n[Eˆ

n

ii

• An unbiased estimator is:

n

iii ˆˆ

n 111 txxC

• These are related by:

Cn)n(ˆ 1

which is asymptotically unbiased. See Burl, AJWills and AWM for excellent examples and explanations of the details of this derivation.

10: MAXIMUM LIKELIHOOD ESTIMATIONEXPECTATION SIMPLIFICATION

• Therefore, the ML estimate is biased:

However, the ML estimate converges (and is MSE).

Documents

•Objectives: Overview General Case Gaussian Cases Resources: DHS – Chap. 3 (Part 1)