Maximum Likelihood Estimation

Maximum Likelihood Maximum Likelihood EstimationEstimation

Psych 818 - DeShonPsych 818 - DeShon

MLE vs. OLSMLE vs. OLS Ordinary Least Squares EstimationOrdinary Least Squares Estimation

Typically yields a closed form solution that can be Typically yields a closed form solution that can be directly computeddirectly computed

Closed form solutions often require very strong assumptionsClosed form solutions often require very strong assumptions Maximum Likelihood EstimationMaximum Likelihood Estimation

Default Method for most estimation problemsDefault Method for most estimation problems Generally equal to OLS when OLS assumptions are Generally equal to OLS when OLS assumptions are

metmet Method yields desirable “asymptotic” estimation Method yields desirable “asymptotic” estimation

properties properties Foundation for Bayesian inferenceFoundation for Bayesian inference Requires numerical methods :(Requires numerical methods :(

MLE logicMLE logic MLE reverses the probability inferenceMLE reverses the probability inference

Recall:Recall: p p(X|(X|)) represents the parameters of a model (i.e., pdf)represents the parameters of a model (i.e., pdf) What’s the probability of observing a score of 73 What’s the probability of observing a score of 73

from a N(70,10) distributionfrom a N(70,10) distribution In MLE, you know the data (XIn MLE, you know the data (Xii))

Primary question: Which of a potentially infinite Primary question: Which of a potentially infinite number of distributions is most likely responsible number of distributions is most likely responsible for generating the data?for generating the data?

pp((||XX)?)?

LikelihoodLikelihood Likelihood may be thought of as an Likelihood may be thought of as an

unbounded or unnormalized unbounded or unnormalized probability measureprobability measure PDF is a function of the data given the PDF is a function of the data given the

parameters on the data scaleparameters on the data scale Likelihood is a function of the Likelihood is a function of the

parameters given the data on the parameters given the data on the parameter scaleparameter scale

LikelihoodLikelihood Likelihood functionLikelihood function

Likelihood is the joint (product) Likelihood is the joint (product) probability of the observed data given probability of the observed data given the parameters of the pdfthe parameters of the pdf

Assume you have XAssume you have X11,…,X,…,Xnn independent independent samples from a given pdf, samples from a given pdf, ff

1( ) ( ,..., | )nL f x x

L( ) f (xi |)i1

n

LikelihoodLikelihood Log-Likelihood functionLog-Likelihood function

Working with products is a painWorking with products is a pain maxima are unaffected by monotone maxima are unaffected by monotone

transformations, so can take the transformations, so can take the logarithm of the likelihood and turn it logarithm of the likelihood and turn it into a suminto a sum

l( ) log f (xi | )i1

n

Maximum LikelihoodMaximum Likelihood Find the value(s) of Find the value(s) of that maximize the that maximize the

likelihood functionlikelihood function

Can sometimes be found analyticallyCan sometimes be found analytically Maximization (or minimization) is the focus Maximization (or minimization) is the focus

of calculus and derivatives of functionsof calculus and derivatives of functions Often requires iterative numeric methodsOften requires iterative numeric methods

argmax

L( )

LikelihoodLikelihood Normal Distribution exampleNormal Distribution example

pdf:pdf:

LikelihoodLikelihood

Log-LikelihoodLog-Likelihood

Note: C is a constant that vanishes once derivatives are takenNote: C is a constant that vanishes once derivatives are taken

2

2

1 ( )( ) exp22xf x

22

2

( )( , ) log( ) log( )

2ix

l C n

L(, )C n exp

(x )2i1

n

2 2

LikelihoodLikelihood Can compute the maximum of this Can compute the maximum of this

log-likelihood function directlylog-likelihood function directly More relevant and fun to estimate it More relevant and fun to estimate it

numerically!numerically!

Normal Distribution exampleNormal Distribution example Assume you obtain 100 samples Assume you obtain 100 samples

from a normal distributionfrom a normal distribution rv.norm <- rnorm(100, mean=5, sd=2)rv.norm <- rnorm(100, mean=5, sd=2)

This is the true data generating model!This is the true data generating model!

Now, assume you don’t know the Now, assume you don’t know the mean of this distribution and we mean of this distribution and we have to estimate it…have to estimate it…

Let’s compute the log-likelihood of Let’s compute the log-likelihood of the observations for N(4,2)the observations for N(4,2)

Normal Distribution exampleNormal Distribution example sum(dnorm(rv.norm, mean=4, sd=2, log=T))sum(dnorm(rv.norm, mean=4, sd=2, log=T))

dnorm gives the probability of an observation for a dnorm gives the probability of an observation for a given distributiongiven distribution

Summing it across observations gives the log-likelihoodSumming it across observations gives the log-likelihood = = -221.0698-221.0698

This is the log-likelihood of the data for the given pdf This is the log-likelihood of the data for the given pdf parametersparameters

Okay, this is the log-likelihood for one Okay, this is the log-likelihood for one possible distribution….we need to examine possible distribution….we need to examine it for all possible distributions and select the it for all possible distributions and select the one that yields the largest valueone that yields the largest value

Normal Distribution exampleNormal Distribution example Make a sequence of possible meansMake a sequence of possible means

m<-seq(from = 1, to = 10, by = 0.1)m<-seq(from = 1, to = 10, by = 0.1) Now, compute the log-likelihood for Now, compute the log-likelihood for

each of the possible meanseach of the possible means This is a simple “grid search” algorithmThis is a simple “grid search” algorithm

log.l<-sapply(m, function (x) sum(dnorm(rv.norm, mean=x, sd=2, log.l<-sapply(m, function (x) sum(dnorm(rv.norm, mean=x, sd=2, log=T)) )log=T)) )

mean log.l1 1.0 -417.38912 1.1 -407.22013 1.2 -397.30124 1.3 -387.63225 1.4 -378.21326 1.5 -369.04427 1.6 -360.12538 1.7 -351.45639 1.8 -343.037310 1.9 -334.868311 2.0 -326.949412 2.1 -319.280413 2.2 -311.861414 2.3 -304.692415 2.4 -297.773416 2.5 -291.104517 2.6 -284.685518 2.7 -278.516519 2.8 -272.597520 2.9 -266.928621 3.0 -261.509622 3.1 -256.340623 3.2 -251.421624 3.3 -246.752725 3.4 -242.333726 3.5 -238.164727 3.6 -234.245728 3.7 -230.576829 3.8 -227.157830 3.9 -223.9888

31 4.0 -221.069832 4.1 -218.400833 4.2 -215.981934 4.3 -213.812935 4.4 -211.893936 4.5 -210.224937 4.6 -208.806038 4.7 -207.637039 4.8 -206.718040 4.9 -206.049041 5.0 -205.630142 5.1 -205.461143 5.2 -205.542144 5.3 -205.873145 5.4 -206.454246 5.5 -207.285247 5.6 -208.366248 5.7 -209.697249 5.8 -211.278250 5.9 -213.109351 6.0 -215.190352 6.1 -217.521353 6.2 -220.102354 6.3 -222.933455 6.4 -226.014456 6.5 -229.345457 6.6 -232.926458 6.7 -236.757559 6.8 -240.838560 6.9 -245.1695

61 7.0 -249.750562 7.1 -254.581663 7.2 -259.662664 7.3 -264.993665 7.4 -270.574666 7.5 -276.405667 7.6 -282.486768 7.7 -288.817769 7.8 -295.398770 7.9 -302.229771 8.0 -309.310872 8.1 -316.641873 8.2 -324.222874 8.3 -332.053875 8.4 -340.134976 8.5 -348.465977 8.6 -357.046978 8.7 -365.877979 8.8 -374.959080 8.9 -384.290081 9.0 -393.871082 9.1 -403.702083 9.2 -413.783084 9.3 -424.114185 9.4 -434.695186 9.5 -445.526187 9.6 -456.607188 9.7 -467.938289 9.8 -479.519290 9.9 -491.350291 10.0 -503.4312

Why are these numbers negative?

Normal Distribution exampleNormal Distribution example

Normal Distribution Normal Distribution exampleexample

dnorm gives us the probability of an dnorm gives us the probability of an observation from the given distributionobservation from the given distribution

The log of a value between 0-1 is The log of a value between 0-1 is negativenegative Log(.05)=-2.99Log(.05)=-2.99

What’s the MLE?What’s the MLE? m[which(log.l==max(log.l))]m[which(log.l==max(log.l))]

= 5.1= 5.1


What about estimating both the What about estimating both the mean and the SD simultaneously?mean and the SD simultaneously? Use grid search approach again…Use grid search approach again… Compute the log-likelihood at each Compute the log-likelihood at each

combination of mean and SDcombination of mean and SD SD Mean log.l1 1.0 1.0 -1061.62012 1.0 1.1 -1022.28433 1.0 1.2 -983.94864 1.0 1.3 -946.61295 1.0 1.4 -910.27716 1.0 1.5 -874.94147 1.0 1.6 -840.60568 1.0 1.7 -807.26999 1.0 1.8 -774.934110 1.0 1.9 -743.598411 1.0 2.0 -713.262712 1.0 2.1 -683.926913 1.0 2.2 -655.591214 1.0 2.3 -628.255415 1.0 2.4 -601.919716 1.0 2.5 -576.583917 1.0 2.6 -552.248218 1.0 2.7 -528.912519 1.0 2.8 -506.576720 1.0 2.9 -485.2410

853 1.9 4.3 -211.3830854 1.9 4.4 -209.6280855 1.9 4.5 -208.1499856 1.9 4.6 -206.9489857 1.9 4.7 -206.0249858 1.9 4.8 -205.3779859 1.9 4.9 -205.0078860 1.9 5.0 -204.9148861 1.9 5.1 -205.0988862 1.9 5.2 -205.5599863 1.9 5.3 -206.2979864 1.9 5.4 -207.3129865 1.9 5.5 -208.6049866 1.9 5.6 -210.1740867 1.9 5.7 -212.0200868 1.9 5.8 -214.1431869 1.9 5.9 -216.5432870 1.9 6.0 -219.2203871 1.9 6.1 -222.1743872 1.9 6.2 -225.4054873 1.9 6.3 -228.9135

6134 7.7 4.6 -299.11326135 7.7 4.7 -299.05696136 7.7 4.8 -299.01756137 7.7 4.9 -298.99506138 7.7 5.0 -298.98936139 7.7 5.1 -299.00066140 7.7 5.2 -299.02866141 7.7 5.3 -299.07366142 7.7 5.4 -299.13546143 7.7 5.5 -299.21406144 7.7 5.6 -299.30966145 7.7 5.7 -299.42206146 7.7 5.8 -299.55126147 7.7 5.9 -299.69746148 7.7 6.0 -299.86046149 7.7 6.1 -300.04026150 7.7 6.2 -300.23706151 7.7 6.3 -300.4506


Get max(log.l)Get max(log.l)

m[which(log.l==max(log.l), m[which(log.l==max(log.l), arr.ind=T)]arr.ind=T)]

= 5.0, 1.9= 5.0, 1.9

Note: this could Note: this could be done the same be done the same way for a simple way for a simple linear regression linear regression (2 parameters)(2 parameters)

AlgorithmsAlgorithms Grid search works for these simple Grid search works for these simple

problems with few estimated parameters problems with few estimated parameters Much more advanced search algorithms Much more advanced search algorithms

are needed for more complex problemsare needed for more complex problems More advanced algs take advantage of the More advanced algs take advantage of the

slope or gradient of the likelihood surface to slope or gradient of the likelihood surface to make good guesses about the direction of make good guesses about the direction of search in parameter spacesearch in parameter space

We’ll use the “mlm” routine in RWe’ll use the “mlm” routine in R

AlgorithmsAlgorithms Grid Search:Grid Search:

Vary each parameter in turn, compute the log-likelihood, Vary each parameter in turn, compute the log-likelihood, then find parameter combination yielding lowest log-then find parameter combination yielding lowest log-likelihoodlikelihood

Gradient Search:Gradient Search: Vary all parameters simultaneously, adjusting relative Vary all parameters simultaneously, adjusting relative

magnitudes of the variations so that the direction of magnitudes of the variations so that the direction of propagation in parameter space is along the direction of propagation in parameter space is along the direction of steepest ascent of log-likelihoodsteepest ascent of log-likelihood

Expansion Methods:Expansion Methods: Find an approximate analytical function that describes the Find an approximate analytical function that describes the

log-likelihood hypersurface and use this function to locate log-likelihood hypersurface and use this function to locate the minimum. Number of computed points is less, but the minimum. Number of computed points is less, but computations are considerably more complicated. computations are considerably more complicated.

Marquardt Method: Gradient-Expansion Marquardt Method: Gradient-Expansion combination combination

R – mlm routineR – mlm routine First we need to define a function to maximizeFirst we need to define a function to maximize

Wait! Most general routines focus on minimizationWait! Most general routines focus on minimization e.g., root finding for solving equationse.g., root finding for solving equations

So, usually minimize –log-likelihoodSo, usually minimize –log-likelihood

norm.func<-function(x,y) {norm.func<-function(x,y) { sum(sapply(rv.norm, function(z) sum(sapply(rv.norm, function(z) -1*dnorm(z, mean=x, sd=y, log=T)))-1*dnorm(z, mean=x, sd=y, log=T))) } }

R – mlm routineR – mlm routine norm.mle <- mle(norm.func, norm.mle <- mle(norm.func,

start=list(x=4,y=2), method="L-BFGS-B", start=list(x=4,y=2), method="L-BFGS-B", lower=c(0, 0))lower=c(0, 0))

Many interesting pointsMany interesting points Starting valuesStarting values

Global vs. local maxima or minimaGlobal vs. local maxima or minima BoundsBounds

SD can’t be negativeSD can’t be negative

R – mlm routineR – mlm routine Output - summary(norm.mle)Output - summary(norm.mle)

Standard errors come from the inverse of Standard errors come from the inverse of the hessian matrixthe hessian matrix

Convergence!!Convergence!! -2(log-likelihood) = deviance-2(log-likelihood) = deviance

Functions like the R2 in regressionFunctions like the R2 in regression

Coeficients: Estimate Std. Errorx 4.844249 0.1817031y 1.817031 0.1284834

-2 log L: 403.2285

> norm.mle@details$convergence[1] 0

Maximum Likelihood RegressionMaximum Likelihood Regression

A standard regression:A standard regression:

May be broken down into two May be broken down into two componentscomponents

0 1Y

0 1ˆ

ˆ( , )

Y

Y N Y

Maximum Likelihood RegressionMaximum Likelihood Regression First define our x's and y'sFirst define our x's and y's

x<- 1:100x<- 1:100y<- 4 + 3*x+rnorm(100, mean=5, sd=20)y<- 4 + 3*x+rnorm(100, mean=5, sd=20)

Define -log likelihood function Define -log likelihood function reg.func <- function(b0,b1,sigma) {reg.func <- function(b0,b1,sigma) {

if(sigma<=0) return(NA) # no sd of 0 or less! if(sigma<=0) return(NA) # no sd of 0 or less! yhat<-b0*x+b1 #the estimated yhat<-b0*x+b1 #the estimated functionfunction -sum(dnorm(y, mean=yhat, sd=sigma,log=T)) -sum(dnorm(y, mean=yhat, sd=sigma,log=T))

#the -log likelihood function#the -log likelihood function } }


Call MLE to minimize the –log-Call MLE to minimize the –log-likelihoodlikelihood

lm.mle<-mle(reg.func, start=list(b0=2, b1=2, lm.mle<-mle(reg.func, start=list(b0=2, b1=2, sigma=35))sigma=35))

Get Get resultsresults - - summary(lm.mle) summary(lm.mle) Coefficients: Estimate Std. Errorb0 3.071449 0.0716271b1 8.959386 4.1663956sigma 20.675930 1.4621709

-2 log L: 889.567


Compare to OLS resultsCompare to OLS results lm(y~x)lm(y~x)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.95635 4.20838 2.128 0.0358 * x 3.07149 0.07235 42.454 <2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 20.88 on 98 degrees of freedomMultiple R-Squared: 0.9484,

Standard Errors of EstimatesStandard Errors of Estimates Behavior of the likelihood function near the Behavior of the likelihood function near the

maximum is importantmaximum is important If it is flat then observations have little to say If it is flat then observations have little to say

about the parametersabout the parameters changes in the parameters will not cause large changes changes in the parameters will not cause large changes

in the probabilityin the probability if the likelihood has a pronounced peak near to the if the likelihood has a pronounced peak near to the

maximum then small changes in parameters maximum then small changes in parameters would cause large changes in probabilitywould cause large changes in probability

In this cases we say that observation has more In this cases we say that observation has more information about parametersinformation about parameters

Expressed as the second derivative (or curvature) Expressed as the second derivative (or curvature) of the log-likelihood functionof the log-likelihood function

If more than 1 parameter, then 2If more than 1 parameter, then 2ndnd partial partial deriviatives deriviatives

Standard Errors of EstimatesStandard Errors of Estimates Rate of change is the second derivative of Rate of change is the second derivative of

a function (e.g., velocity and acceleration)a function (e.g., velocity and acceleration) Hessian Matrix is the matrix of 2Hessian Matrix is the matrix of 2ndnd partial partial

derivatives of the -log-likelihood functionderivatives of the -log-likelihood function

The entries in the Hessian are called the The entries in the Hessian are called the observed observed informationinformation for an estimate for an estimate

Standard ErrorsStandard Errors Information is used to obtained the Information is used to obtained the

expected variance (or standard error) or expected variance (or standard error) or the estimated parametersthe estimated parameters

When sample size becomes large then When sample size becomes large then maximum likelihood estimator becomes maximum likelihood estimator becomes approximately normally distributed with approximately normally distributed with variance close tovariance close to

More precisely…More precisely…

1( )I

1var( ) ( )I

Likelihood Ratio TestLikelihood Ratio Test Let LLet LFF be the maximum of the likelihood be the maximum of the likelihood

function for an unrestricted modelfunction for an unrestricted model Let LLet LRR be the maximum of the likelihood be the maximum of the likelihood

function of a restricted model nested in the full function of a restricted model nested in the full modelmodel

LLFF must be greater than or equal to L must be greater than or equal to LRR Removing a variable or adding a constraint can only Removing a variable or adding a constraint can only

hurt model fit. Same logic as Rhurt model fit. Same logic as R22

Question: Does adding the constraint or Question: Does adding the constraint or removing the variable (constraint of zero) removing the variable (constraint of zero) significantly impact model fit? significantly impact model fit? Model fit will decrease but does it decrease more Model fit will decrease but does it decrease more

than would be expected by chance?than would be expected by chance?

Likelihood Ratio TestLikelihood Ratio Test Likelihood RatioLikelihood Ratio R = -2ln(LR = -2ln(LRR / L / LFF)) R = 2(log(LR = 2(log(LFF) – log(L) – log(LRR)) )) R is distributed as chi-square distribution with m R is distributed as chi-square distribution with m

degrees of freedomdegrees of freedom m is the difference in the number of estimated parameters m is the difference in the number of estimated parameters

between the two models.between the two models. The expected value of R is m, so if you get an R that is The expected value of R is m, so if you get an R that is

bigger than the difference in parameters then the bigger than the difference in parameters then the constraint hurts model fit.constraint hurts model fit.

More formally…. You should reference the chi-square table More formally…. You should reference the chi-square table with m degrees of freedom to find the probability of getting with m degrees of freedom to find the probability of getting R by chance alone, assuming that the null hypothesis is R by chance alone, assuming that the null hypothesis is true.true.

Likelihood Ratio ExampleLikelihood Ratio Example Go back to our simple regression exampleGo back to our simple regression example Does the variable (X) significantly improve Does the variable (X) significantly improve

our predictive ability or model fit?our predictive ability or model fit? Alternatively, does removing X or constraining Alternatively, does removing X or constraining

it’s parameter estimate to zero significantly it’s parameter estimate to zero significantly decrease prediction or model fit?decrease prediction or model fit?

Full Model: -2log-L = Full Model: -2log-L = 889.567 Reduced Model: -2log-L =1186.05-2log-L =1186.05 Chi-square critical value = 3.84Chi-square critical value = 3.84

Fit IndicesFit Indices Akaike’s information criterion (AIC)Akaike’s information criterion (AIC)

Pronounced “Ah-kah-ee-key”Pronounced “Ah-kah-ee-key”

KK is the number of estimated parameters in is the number of estimated parameters in our model. our model.

Penalizes the log-likelihood for using many Penalizes the log-likelihood for using many parameters to increase fitparameters to increase fit

Choose the model with the smallest AIC valueChoose the model with the smallest AIC value

ˆ2 ( ) 2AIC LogL k

Fit IndicesFit Indices Bayesian Information Criterion (BIC)Bayesian Information Criterion (BIC)

AKA- SIC for Schwarz Information Criterion AKA- SIC for Schwarz Information Criterion

Choose the model with the smallest BICChoose the model with the smallest BIC the likelihood is the probability of obtaining the data the likelihood is the probability of obtaining the data

you did under the given model. It makes sense to you did under the given model. It makes sense to choose a model that makes this probability as large as choose a model that makes this probability as large as possible. But putting the minus sign in front switches possible. But putting the minus sign in front switches the maximization to minimizationthe maximization to minimization

ˆ2 ( ) log( )BIC LogL k n

Multiple RegressionMultiple Regression -Log-Likelihood function for multiple -Log-Likelihood function for multiple

regressionregression#Note, theta is a vector of parameters, with std.dev being the #Note, theta is a vector of parameters, with std.dev being the

first onefirst one#theta[-1] is all values of theta, except the first#theta[-1] is all values of theta, except the first#and here we're using matrix multiplication#and here we're using matrix multiplication

ols.lf3 <- function(theta, y, X) {ols.lf3 <- function(theta, y, X) { if (theta[1] <= 0) return(NA) if (theta[1] <= 0) return(NA) -sum(dnorm(y, mean = X %*% theta[-1], sd = -sum(dnorm(y, mean = X %*% theta[-1], sd = sqrt(theta[1]), log = TRUE))sqrt(theta[1]), log = TRUE))} }

Documents

Maximum Likelihood Estimation