SDS 321: Introduction to Probability and Statistics ...psarkar/sds321/lecture24-new.pdf · I Maximum Likelihood estimation (section 9.1): Select the parameter that makes the observed

SDS 321: Introduction to Probability andStatistics

Lecture 24: Maximum Likelihood Estimation

Purnamrita SarkarDepartment of Statistics and Data Science

The University of Texas at Austin

www.cs.cmu.edu/∼psarkar/teaching

1

Roadmap

I Frequentist Statistics introduction

I Biased, Unbiased and Asymptotically unbiased estimators

I M.L.E and how to find it.

2

Frequentist Statistics

I The parameter(s) θ is fixed and unknown

I Data is generated through the likelihood function p(X ; θ) (ifdiscrete) or f (X ; θ) (if continuous).

I Now we will be dealing with multiple candidate models, one for eachvalue of θ

I We will use Eθ[h(X )] to define the expectation of the randomvariable h(X ) as a function of parameter θ

3

Problems we will look at

I Parameter estimation: We want to estimate unknown parametersfrom data.

I Maximum Likelihood estimation (section 9.1): Select theparameter that makes the observed data most likely.

I i.e. maximize the probability of obtaining the data at hand.

I Hypothesis testing: An unknown parameter takes a finite numberof values. One wants to find the best hypothesis based on the data.

I Significance testing: Given a hypothesis, figure out the rejectionregion and reject the hypothesis if the observation falls within thisregion.

4

Classical parameter estimation

We are given observations X = (X1, . . . ,Xn). An estimator is a randomvariable of the form Θ = g(X ) (sometime also denoted by Θn).

I Since the distribution of X depends on θ, so does the distribution Θn

I The mean and variance of Θn can be defined as Eθ[Θn] andvarθ(Θn).

I For simplicity we will also use E [.] and var[.] and drop the θ from thenotation

I The estimation error denoted by Θn = Θn − θ.

I Bias of an estimator is given by bθ(Θn) = Eθ[Θn]− θI An Unbiased estimator is one for which E [Θn] = θ

I An asymptotically unbiased estimator is one for whichlim

n→∞E [Θn] = θ

5

Maximum Likelihood Estimation

I Find the θ that maximizes the joint likelihood p(X1, . . . ,Xn; θ) (ofthe joint pdf for continuous random variables).

I We have P(Xi ; θ) for random variable Xi

I Often Xi are independent and so P(X1, . . . ,Xn; θ) =∏i

P(Xi ; θ)

I We want to calculate the Maximum Likelihood Estimate θ

I First calculate P(X1, . . . ,Xn; θ) =∏i

P(Xi ; θ)

I Now calculate the logarithm. logP(X1, . . . ,Xn; θ) =∑i

logP(Xi ; θ)

I Now take a derivative and set it to zero.d

dθ

∑i

logP(Xi ; θ) = 0 and

solve for θ

6

Likelihood vs. Log likelihood

(A) Likelihood (B) Log-likelihood[Takeaway]

I Loglikelihood is a monotonic function of the likelihood.

I So the maximum is achieved at the same point, albeit, in most caseswith a lot less computation.

7

Estimating the parameter of the Exponential

n customers arrive at a mall at times Yi . We take Y0 = 0. The interarrival times are Xi = Yi − Yi−1 are often modeled as i.i.d Exponential(λ)

r.v’s. We want to the MLE of λ.

I First write fXi(xi ;λ) = λe−λxi

I Now write the joint likelihood fX (x ;λ) =∏i

λe−λxi = λn∏i

e−λxi

I Now take logarithm of the joint likelihood.log fX (x ;λ) = n log λ− λ

∑i

xi .

I Differentiate and set to zero to get the MLE.

n1

λ−∑i

xi = 0→ λ =1∑i xi/n

.

8

Maximum Likelihood EstimationLets start with an example.

I You have a n i.i.d. bernoulli random variables Xi ∼ Bernoulli(p).I Find the Maximum Likelihood Estimate of p

I First write the likelihood of r.v i conveniently as P(Xi ; p) = pXi (1− p)1−XiI Now note that the r.v’s are all independent and so

P(X1, . . . ,Xn; p) =n∏

i=1

pXi (1− p)1−Xi

I Its often more convenient to maximizes the logarithm of a product form.

I

logP(X1, . . . ,Xn; p) =∑i

logP(Xi ; p)

=∑i

log(pXi (1− p)1−Xi

)=∑i

Xi log p +∑i

(1− Xi ) log(1− p)

I p = arg maxp

logP(X1, . . . ,Xn; p)

I ∑i Xip−

n −∑

i Xi1− p

= 0→ p =

∑i Xin

9


I You have a n i.i.d. bernoulli random variables Xi ∼ Bernoulli(p).I Find the Maximum Likelihood Estimate of pI First write the likelihood of r.v i conveniently as P(Xi ; p) = pXi (1− p)1−Xi

I Now note that the r.v’s are all independent and so

P(X1, . . . ,Xn; p) =n∏

i=1

pXi (1− p)1−Xi


I

logP(X1, . . . ,Xn; p) =∑i

logP(Xi ; p)

=∑i


)=∑i

Xi log p +∑i

(1− Xi ) log(1− p)

I p = arg maxp

logP(X1, . . . ,Xn; p)

I ∑i Xip−

n −∑

i Xi1− p

= 0→ p =

∑i Xin

9


I You have a n i.i.d. bernoulli random variables Xi ∼ Bernoulli(p).I Find the Maximum Likelihood Estimate of pI First write the likelihood of r.v i conveniently as P(Xi ; p) = pXi (1− p)1−XiI Now note that the r.v’s are all independent and so

P(X1, . . . ,Xn; p) =n∏

i=1

pXi (1− p)1−Xi


I

logP(X1, . . . ,Xn; p) =∑i

logP(Xi ; p)

=∑i


)=∑i

Xi log p +∑i

(1− Xi ) log(1− p)

I p = arg maxp

logP(X1, . . . ,Xn; p)

I ∑i Xip−

n −∑

i Xi1− p

= 0→ p =

∑i Xin

9



P(X1, . . . ,Xn; p) =n∏

i=1

pXi (1− p)1−Xi


I

logP(X1, . . . ,Xn; p) =∑i

logP(Xi ; p)

=∑i


)=∑i

Xi log p +∑i

(1− Xi ) log(1− p)

I p = arg maxp

logP(X1, . . . ,Xn; p)

I ∑i Xip−

n −∑

i Xi1− p

= 0→ p =

∑i Xin

9



P(X1, . . . ,Xn; p) =n∏

i=1

pXi (1− p)1−Xi


I

logP(X1, . . . ,Xn; p) =∑i

logP(Xi ; p)

=∑i


)=∑i

Xi log p +∑i

(1− Xi ) log(1− p)

I p = arg maxp

logP(X1, . . . ,Xn; p)

I ∑i Xip−

n −∑

i Xi1− p

= 0→ p =

∑i Xin

9



P(X1, . . . ,Xn; p) =n∏

i=1

pXi (1− p)1−Xi


I

logP(X1, . . . ,Xn; p) =∑i

logP(Xi ; p)

=∑i


)=∑i

Xi log p +∑i

(1− Xi ) log(1− p)

I p = arg maxp

logP(X1, . . . ,Xn; p)

I ∑i Xip−

n −∑

i Xi1− p

= 0→ p =

∑i Xin

9

MLE estimate of mean of Gaussian r.v.’s

I have n iid random variables from a N(µ, σ2) distribution. I know σ2 butnot µ. Whats the MLE of µ?

I Notation: X = (X1, . . . ,Xn) and x = (x1, . . . , xn).

I First write fX (x ;µ, σ) =1√

2πσ2e− (xi−µ)

2

2σ2

I Now write the joint PDF fX (x ;µ, σ) =∏i

fXi(xi ;µ, σ)

I Now take the logarithm of the joint likelihood.

log fX (x ;µ, σ) = −n log(√

2πσ2)−∑i

(xi − µ)2

2σ2

I Derive w.r.t µ and set to zero to solve for µ. −∑

i (xi − µ)

σ2= 0

I Derive w.r.t µ and set to zero to solve for µ.∑i

xi − nµ = 0→ µ =

∑i xin

I The MLE is just the sample mean, which is often denoted by X

10






2

2σ2


fXi(xi ;µ, σ)



2πσ2)−∑i

(xi − µ)2

2σ2


i (xi − µ)

σ2= 0


xi − nµ = 0→ µ =

∑i xin


10






2

2σ2


fXi(xi ;µ, σ)



2πσ2)−∑i

(xi − µ)2

2σ2


i (xi − µ)

σ2= 0


xi − nµ = 0→ µ =

∑i xin


10






2

2σ2


fXi(xi ;µ, σ)



2πσ2)−∑i

(xi − µ)2

2σ2


i (xi − µ)

σ2= 0


xi − nµ = 0→ µ =

∑i xin


10






2

2σ2


fXi(xi ;µ, σ)



2πσ2)−∑i

(xi − µ)2

2σ2


i (xi − µ)

σ2= 0


xi − nµ = 0→ µ =

∑i xin


10


I have n iid random variables from a N(µ, σ2) distribution. I do not knowσ2 or µ. Whats the MLE of µ and σ2?

I We start with taking the logarithm of the joint likelihood.


2πσ2)−∑i

(xi − µ)2

2σ2∗

I First derive w.r.t µ and set to zero to solve for µ. −∑

i (xi − µ)

σ2= 0

I Next derive w.r.t σ and set to zero to solve for σ.

− n

σ+∑i

2(xi − nµ)2

2σ3= 0

∗Thanks to Aneesh Adhikari-Desai for correcting this!11





2πσ2)−∑i

(xi − µ)2

2σ2∗


i (xi − µ)

σ2= 0


− n

σ+∑i

2(xi − nµ)2

2σ3= 0






2πσ2)−∑i

(xi − µ)2

2σ2∗


i (xi − µ)

σ2= 0


− n

σ+∑i

2(xi − nµ)2

2σ3= 0






2πσ2)−∑i

(xi − µ)2

2σ2∗


i (xi − µ)

σ2= 0


− n

σ+∑i

2(xi − nµ)2

2σ3= 0




I Now you have two equations and two unknowns!†∑i (xi − µ)

σ2= 0

− n

σ+∑i

(xi − µ)2

σ3= 0

I Solving we see:

µ =∑i

xi/n = x σ2 =

∑i (xi − x)2

n

†Thanks to Aneesh Adhikari-Desai for correcting this!12




σ2= 0

− n

σ+∑i

(xi − µ)2

σ3= 0

I Solving we see:

µ =∑i

xi/n = x σ2 =

∑i (xi − x)2

n





σ2= 0

− n

σ+∑i

(xi − µ)2

σ3= 0

I Solving we see:

µ =∑i

xi/n = x σ2 =

∑i (xi − x)2

n


Estimating the lower limit of the Uniform

I have n independent Uniform([a, 1]) random variables. Whats the MLEof a? Although most time taking the log helps, sometimes its easier towork with the joint likelihood directly.

I First write f (Xi ; a) =1{a ≤ Xi ≤ 1}

(1− a)

I Now write f (X1,X2, . . .Xn; a) =1

(1− a)n

n∏i=1

1{a ≤ Xi ≤ 1} =

1{a ≤ min(X1,X2, . . . ) ≤ max(X1,X2, . . . ) ≤ 1}(1− a)n

‡

I How do I maximize this? Well, a has to be less thanmin(X1, . . . ,Xn), otherwise the likelihood will be zero.

I For all a ≤ min(X1, . . . ,Xn), what maximizes 1/(1− a)n?

I For all a ≤ min(X1, . . . ,Xn), what is the largest a? (larger a smaller1/(1− a)n.

I a = min(X1, . . . ,Xn)!

‡Thanks Miles Hutson for correcting this!13




(1− a)


(1− a)n

n∏i=1

1{a ≤ Xi ≤ 1} =

1{a ≤ min(X1,X2, . . . ) ≤ max(X1,X2, . . . ) ≤ 1}(1− a)n

‡




I a = min(X1, . . . ,Xn)!





(1− a)


(1− a)n

n∏i=1

1{a ≤ Xi ≤ 1} =

1{a ≤ min(X1,X2, . . . ) ≤ max(X1,X2, . . . ) ≤ 1}(1− a)n

‡




I a = min(X1, . . . ,Xn)!





(1− a)


(1− a)n

n∏i=1

1{a ≤ Xi ≤ 1} =

1{a ≤ min(X1,X2, . . . ) ≤ max(X1,X2, . . . ) ≤ 1}(1− a)n

‡




I a = min(X1, . . . ,Xn)!





(1− a)


(1− a)n

n∏i=1

1{a ≤ Xi ≤ 1} =

1{a ≤ min(X1,X2, . . . ) ≤ max(X1,X2, . . . ) ≤ 1}(1− a)n

‡




I a = min(X1, . . . ,Xn)!


Unbiased and asymptotically unbiased estimators

Recall that the MLE of σ2 obtained using n iid random variables from aN(µ, σ2) distribution are given by:

σ2 =

∑i (xi − x)2

nσ2 =

∑i

(xi − µ)2

n

This is also the sample variance. Is it unbiased?

I First note that∑i

(xi − x)2

n=∑i

x2i − 2xi x + x2

n

=∑i

x2in− 2x2 + x2

=∑i

x2in− x2

14


Recall that the MLE of σ2 obtained using n iid random variables from aN(µ, σ2) distribution are given by:

σ2 =

∑i (xi − x)2

nσ2 =

∑i

(xi − µ)2

n

This is also the sample variance. Is it unbiased?

I First note that∑i

(xi − x)2

n=∑i

x2i − 2xi x + x2

n

=∑i

x2in− 2x2 + x2

=∑i

x2in− x2

14


I E

∑i

(xi − x)2

n

= E [X21 ]− E [X2]

I E [X21 ] = σ2 + µ2 and E [X2] = var(X ) + E [X ]2 =

σ2

n+ µ2

I And so, E [σ2] = σ2 − σ2/n = σ2(1− 1/n)

I So the MLE of the variance is not unbiased. However it isasymptotically unbiased!

I Also you can have another estimator∑i

(xi − x)2/(n − 1), which is

not the MLE, but it is unbiased.

15


I E

∑i

(xi − x)2

n

= E [X21 ]− E [X2]


σ2

n+ µ2

I And so, E [σ2] = σ2 − σ2/n = σ2(1− 1/n)





15


I E

∑i

(xi − x)2

n

= E [X21 ]− E [X2]


σ2

n+ µ2

I And so, E [σ2] = σ2 − σ2/n = σ2(1− 1/n)





15


I E

∑i

(xi − x)2

n

= E [X21 ]− E [X2]


σ2

n+ µ2

I And so, E [σ2] = σ2 − σ2/n = σ2(1− 1/n)





15


I E

∑i

(xi − x)2

n

= E [X21 ]− E [X2]


σ2

n+ µ2

I And so, E [σ2] = σ2 − σ2/n = σ2(1− 1/n)





15

Documents

SDS 321: Introduction to Probability and Statistics ...psarkar/sds321/lecture24-new.pdf · I Maximum Likelihood estimation (section 9.1): Select the parameter that makes the observed