Upload
orenda
View
56
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Maximum Likelihood Estimation. Psych 818 - DeShon. MLE vs. OLS. Ordinary Least Squares Estimation Typically yields a closed form solution that can be directly computed Closed form solutions often require very strong assumptions Maximum Likelihood Estimation - PowerPoint PPT Presentation
Citation preview
Maximum Likelihood Maximum Likelihood EstimationEstimation
Psych 818 - DeShonPsych 818 - DeShon
MLE vs. OLSMLE vs. OLS Ordinary Least Squares EstimationOrdinary Least Squares Estimation
Typically yields a closed form solution that can be Typically yields a closed form solution that can be directly computeddirectly computed
Closed form solutions often require very strong assumptionsClosed form solutions often require very strong assumptions Maximum Likelihood EstimationMaximum Likelihood Estimation
Default Method for most estimation problemsDefault Method for most estimation problems Generally equal to OLS when OLS assumptions are Generally equal to OLS when OLS assumptions are
metmet Method yields desirable “asymptotic” estimation Method yields desirable “asymptotic” estimation
properties properties Foundation for Bayesian inferenceFoundation for Bayesian inference Requires numerical methods :(Requires numerical methods :(
MLE logicMLE logic MLE reverses the probability inferenceMLE reverses the probability inference
Recall:Recall: p p(X|(X|)) represents the parameters of a model (i.e., pdf)represents the parameters of a model (i.e., pdf) What’s the probability of observing a score of 73 What’s the probability of observing a score of 73
from a N(70,10) distributionfrom a N(70,10) distribution In MLE, you know the data (XIn MLE, you know the data (Xii))
Primary question: Which of a potentially infinite Primary question: Which of a potentially infinite number of distributions is most likely responsible number of distributions is most likely responsible for generating the data?for generating the data?
pp((||XX)?)?
LikelihoodLikelihood Likelihood may be thought of as an Likelihood may be thought of as an
unbounded or unnormalized unbounded or unnormalized probability measureprobability measure PDF is a function of the data given the PDF is a function of the data given the
parameters on the data scaleparameters on the data scale Likelihood is a function of the Likelihood is a function of the
parameters given the data on the parameters given the data on the parameter scaleparameter scale
LikelihoodLikelihood Likelihood functionLikelihood function
Likelihood is the joint (product) Likelihood is the joint (product) probability of the observed data given probability of the observed data given the parameters of the pdfthe parameters of the pdf
Assume you have XAssume you have X11,…,X,…,Xnn independent independent samples from a given pdf, samples from a given pdf, ff
1( ) ( ,..., | )nL f x x
L( ) f (xi |)i1
n
LikelihoodLikelihood Log-Likelihood functionLog-Likelihood function
Working with products is a painWorking with products is a pain maxima are unaffected by monotone maxima are unaffected by monotone
transformations, so can take the transformations, so can take the logarithm of the likelihood and turn it logarithm of the likelihood and turn it into a suminto a sum
l( ) log f (xi | )i1
n
Maximum LikelihoodMaximum Likelihood Find the value(s) of Find the value(s) of that maximize the that maximize the
likelihood functionlikelihood function
Can sometimes be found analyticallyCan sometimes be found analytically Maximization (or minimization) is the focus Maximization (or minimization) is the focus
of calculus and derivatives of functionsof calculus and derivatives of functions Often requires iterative numeric methodsOften requires iterative numeric methods
argmax
L( )
LikelihoodLikelihood Normal Distribution exampleNormal Distribution example
pdf:pdf:
LikelihoodLikelihood
Log-LikelihoodLog-Likelihood
Note: C is a constant that vanishes once derivatives are takenNote: C is a constant that vanishes once derivatives are taken
2
2
1 ( )( ) exp22xf x
22
2
( )( , ) log( ) log( )
2ix
l C n
L(, )C n exp
(x )2i1
n
2 2
LikelihoodLikelihood Can compute the maximum of this Can compute the maximum of this
log-likelihood function directlylog-likelihood function directly More relevant and fun to estimate it More relevant and fun to estimate it
numerically!numerically!
Normal Distribution exampleNormal Distribution example Assume you obtain 100 samples Assume you obtain 100 samples
from a normal distributionfrom a normal distribution rv.norm <- rnorm(100, mean=5, sd=2)rv.norm <- rnorm(100, mean=5, sd=2)
This is the true data generating model!This is the true data generating model!
Now, assume you don’t know the Now, assume you don’t know the mean of this distribution and we mean of this distribution and we have to estimate it…have to estimate it…
Let’s compute the log-likelihood of Let’s compute the log-likelihood of the observations for N(4,2)the observations for N(4,2)
Normal Distribution exampleNormal Distribution example sum(dnorm(rv.norm, mean=4, sd=2, log=T))sum(dnorm(rv.norm, mean=4, sd=2, log=T))
dnorm gives the probability of an observation for a dnorm gives the probability of an observation for a given distributiongiven distribution
Summing it across observations gives the log-likelihoodSumming it across observations gives the log-likelihood = = -221.0698-221.0698
This is the log-likelihood of the data for the given pdf This is the log-likelihood of the data for the given pdf parametersparameters
Okay, this is the log-likelihood for one Okay, this is the log-likelihood for one possible distribution….we need to examine possible distribution….we need to examine it for all possible distributions and select the it for all possible distributions and select the one that yields the largest valueone that yields the largest value
Normal Distribution exampleNormal Distribution example Make a sequence of possible meansMake a sequence of possible means
m<-seq(from = 1, to = 10, by = 0.1)m<-seq(from = 1, to = 10, by = 0.1) Now, compute the log-likelihood for Now, compute the log-likelihood for
each of the possible meanseach of the possible means This is a simple “grid search” algorithmThis is a simple “grid search” algorithm
log.l<-sapply(m, function (x) sum(dnorm(rv.norm, mean=x, sd=2, log.l<-sapply(m, function (x) sum(dnorm(rv.norm, mean=x, sd=2, log=T)) )log=T)) )
mean log.l1 1.0 -417.38912 1.1 -407.22013 1.2 -397.30124 1.3 -387.63225 1.4 -378.21326 1.5 -369.04427 1.6 -360.12538 1.7 -351.45639 1.8 -343.037310 1.9 -334.868311 2.0 -326.949412 2.1 -319.280413 2.2 -311.861414 2.3 -304.692415 2.4 -297.773416 2.5 -291.104517 2.6 -284.685518 2.7 -278.516519 2.8 -272.597520 2.9 -266.928621 3.0 -261.509622 3.1 -256.340623 3.2 -251.421624 3.3 -246.752725 3.4 -242.333726 3.5 -238.164727 3.6 -234.245728 3.7 -230.576829 3.8 -227.157830 3.9 -223.9888
31 4.0 -221.069832 4.1 -218.400833 4.2 -215.981934 4.3 -213.812935 4.4 -211.893936 4.5 -210.224937 4.6 -208.806038 4.7 -207.637039 4.8 -206.718040 4.9 -206.049041 5.0 -205.630142 5.1 -205.461143 5.2 -205.542144 5.3 -205.873145 5.4 -206.454246 5.5 -207.285247 5.6 -208.366248 5.7 -209.697249 5.8 -211.278250 5.9 -213.109351 6.0 -215.190352 6.1 -217.521353 6.2 -220.102354 6.3 -222.933455 6.4 -226.014456 6.5 -229.345457 6.6 -232.926458 6.7 -236.757559 6.8 -240.838560 6.9 -245.1695
61 7.0 -249.750562 7.1 -254.581663 7.2 -259.662664 7.3 -264.993665 7.4 -270.574666 7.5 -276.405667 7.6 -282.486768 7.7 -288.817769 7.8 -295.398770 7.9 -302.229771 8.0 -309.310872 8.1 -316.641873 8.2 -324.222874 8.3 -332.053875 8.4 -340.134976 8.5 -348.465977 8.6 -357.046978 8.7 -365.877979 8.8 -374.959080 8.9 -384.290081 9.0 -393.871082 9.1 -403.702083 9.2 -413.783084 9.3 -424.114185 9.4 -434.695186 9.5 -445.526187 9.6 -456.607188 9.7 -467.938289 9.8 -479.519290 9.9 -491.350291 10.0 -503.4312
Why are these numbers negative?
Normal Distribution exampleNormal Distribution example
Normal Distribution Normal Distribution exampleexample
dnorm gives us the probability of an dnorm gives us the probability of an observation from the given distributionobservation from the given distribution
The log of a value between 0-1 is The log of a value between 0-1 is negativenegative Log(.05)=-2.99Log(.05)=-2.99
What’s the MLE?What’s the MLE? m[which(log.l==max(log.l))]m[which(log.l==max(log.l))]
= 5.1= 5.1
Normal Distribution Normal Distribution exampleexample
What about estimating both the What about estimating both the mean and the SD simultaneously?mean and the SD simultaneously? Use grid search approach again…Use grid search approach again… Compute the log-likelihood at each Compute the log-likelihood at each
combination of mean and SDcombination of mean and SD SD Mean log.l1 1.0 1.0 -1061.62012 1.0 1.1 -1022.28433 1.0 1.2 -983.94864 1.0 1.3 -946.61295 1.0 1.4 -910.27716 1.0 1.5 -874.94147 1.0 1.6 -840.60568 1.0 1.7 -807.26999 1.0 1.8 -774.934110 1.0 1.9 -743.598411 1.0 2.0 -713.262712 1.0 2.1 -683.926913 1.0 2.2 -655.591214 1.0 2.3 -628.255415 1.0 2.4 -601.919716 1.0 2.5 -576.583917 1.0 2.6 -552.248218 1.0 2.7 -528.912519 1.0 2.8 -506.576720 1.0 2.9 -485.2410
853 1.9 4.3 -211.3830854 1.9 4.4 -209.6280855 1.9 4.5 -208.1499856 1.9 4.6 -206.9489857 1.9 4.7 -206.0249858 1.9 4.8 -205.3779859 1.9 4.9 -205.0078860 1.9 5.0 -204.9148861 1.9 5.1 -205.0988862 1.9 5.2 -205.5599863 1.9 5.3 -206.2979864 1.9 5.4 -207.3129865 1.9 5.5 -208.6049866 1.9 5.6 -210.1740867 1.9 5.7 -212.0200868 1.9 5.8 -214.1431869 1.9 5.9 -216.5432870 1.9 6.0 -219.2203871 1.9 6.1 -222.1743872 1.9 6.2 -225.4054873 1.9 6.3 -228.9135
6134 7.7 4.6 -299.11326135 7.7 4.7 -299.05696136 7.7 4.8 -299.01756137 7.7 4.9 -298.99506138 7.7 5.0 -298.98936139 7.7 5.1 -299.00066140 7.7 5.2 -299.02866141 7.7 5.3 -299.07366142 7.7 5.4 -299.13546143 7.7 5.5 -299.21406144 7.7 5.6 -299.30966145 7.7 5.7 -299.42206146 7.7 5.8 -299.55126147 7.7 5.9 -299.69746148 7.7 6.0 -299.86046149 7.7 6.1 -300.04026150 7.7 6.2 -300.23706151 7.7 6.3 -300.4506
Normal Distribution Normal Distribution exampleexample
Get max(log.l)Get max(log.l)
m[which(log.l==max(log.l), m[which(log.l==max(log.l), arr.ind=T)]arr.ind=T)]
= 5.0, 1.9= 5.0, 1.9
Note: this could Note: this could be done the same be done the same way for a simple way for a simple linear regression linear regression (2 parameters)(2 parameters)
AlgorithmsAlgorithms Grid search works for these simple Grid search works for these simple
problems with few estimated parameters problems with few estimated parameters Much more advanced search algorithms Much more advanced search algorithms
are needed for more complex problemsare needed for more complex problems More advanced algs take advantage of the More advanced algs take advantage of the
slope or gradient of the likelihood surface to slope or gradient of the likelihood surface to make good guesses about the direction of make good guesses about the direction of search in parameter spacesearch in parameter space
We’ll use the “mlm” routine in RWe’ll use the “mlm” routine in R
AlgorithmsAlgorithms Grid Search:Grid Search:
Vary each parameter in turn, compute the log-likelihood, Vary each parameter in turn, compute the log-likelihood, then find parameter combination yielding lowest log-then find parameter combination yielding lowest log-likelihoodlikelihood
Gradient Search:Gradient Search: Vary all parameters simultaneously, adjusting relative Vary all parameters simultaneously, adjusting relative
magnitudes of the variations so that the direction of magnitudes of the variations so that the direction of propagation in parameter space is along the direction of propagation in parameter space is along the direction of steepest ascent of log-likelihoodsteepest ascent of log-likelihood
Expansion Methods:Expansion Methods: Find an approximate analytical function that describes the Find an approximate analytical function that describes the
log-likelihood hypersurface and use this function to locate log-likelihood hypersurface and use this function to locate the minimum. Number of computed points is less, but the minimum. Number of computed points is less, but computations are considerably more complicated. computations are considerably more complicated.
Marquardt Method: Gradient-Expansion Marquardt Method: Gradient-Expansion combination combination
R – mlm routineR – mlm routine First we need to define a function to maximizeFirst we need to define a function to maximize
Wait! Most general routines focus on minimizationWait! Most general routines focus on minimization e.g., root finding for solving equationse.g., root finding for solving equations
So, usually minimize –log-likelihoodSo, usually minimize –log-likelihood
norm.func<-function(x,y) {norm.func<-function(x,y) { sum(sapply(rv.norm, function(z) sum(sapply(rv.norm, function(z) -1*dnorm(z, mean=x, sd=y, log=T)))-1*dnorm(z, mean=x, sd=y, log=T))) } }
R – mlm routineR – mlm routine norm.mle <- mle(norm.func, norm.mle <- mle(norm.func,
start=list(x=4,y=2), method="L-BFGS-B", start=list(x=4,y=2), method="L-BFGS-B", lower=c(0, 0))lower=c(0, 0))
Many interesting pointsMany interesting points Starting valuesStarting values
Global vs. local maxima or minimaGlobal vs. local maxima or minima BoundsBounds
SD can’t be negativeSD can’t be negative
R – mlm routineR – mlm routine Output - summary(norm.mle)Output - summary(norm.mle)
Standard errors come from the inverse of Standard errors come from the inverse of the hessian matrixthe hessian matrix
Convergence!!Convergence!! -2(log-likelihood) = deviance-2(log-likelihood) = deviance
Functions like the R2 in regressionFunctions like the R2 in regression
Coeficients: Estimate Std. Errorx 4.844249 0.1817031y 1.817031 0.1284834
-2 log L: 403.2285
> norm.mle@details$convergence[1] 0
Maximum Likelihood RegressionMaximum Likelihood Regression
A standard regression:A standard regression:
May be broken down into two May be broken down into two componentscomponents
0 1Y
0 1ˆ
ˆ( , )
Y
Y N Y
Maximum Likelihood RegressionMaximum Likelihood Regression First define our x's and y'sFirst define our x's and y's
x<- 1:100x<- 1:100y<- 4 + 3*x+rnorm(100, mean=5, sd=20)y<- 4 + 3*x+rnorm(100, mean=5, sd=20)
Define -log likelihood function Define -log likelihood function reg.func <- function(b0,b1,sigma) {reg.func <- function(b0,b1,sigma) {
if(sigma<=0) return(NA) # no sd of 0 or less! if(sigma<=0) return(NA) # no sd of 0 or less! yhat<-b0*x+b1 #the estimated yhat<-b0*x+b1 #the estimated functionfunction -sum(dnorm(y, mean=yhat, sd=sigma,log=T)) -sum(dnorm(y, mean=yhat, sd=sigma,log=T))
#the -log likelihood function#the -log likelihood function } }
Maximum Likelihood RegressionMaximum Likelihood Regression
Call MLE to minimize the –log-Call MLE to minimize the –log-likelihoodlikelihood
lm.mle<-mle(reg.func, start=list(b0=2, b1=2, lm.mle<-mle(reg.func, start=list(b0=2, b1=2, sigma=35))sigma=35))
Get Get resultsresults - - summary(lm.mle) summary(lm.mle) Coefficients: Estimate Std. Errorb0 3.071449 0.0716271b1 8.959386 4.1663956sigma 20.675930 1.4621709
-2 log L: 889.567
Maximum Likelihood RegressionMaximum Likelihood Regression
Compare to OLS resultsCompare to OLS results lm(y~x)lm(y~x)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.95635 4.20838 2.128 0.0358 * x 3.07149 0.07235 42.454 <2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 20.88 on 98 degrees of freedomMultiple R-Squared: 0.9484,
Standard Errors of EstimatesStandard Errors of Estimates Behavior of the likelihood function near the Behavior of the likelihood function near the
maximum is importantmaximum is important If it is flat then observations have little to say If it is flat then observations have little to say
about the parametersabout the parameters changes in the parameters will not cause large changes changes in the parameters will not cause large changes
in the probabilityin the probability if the likelihood has a pronounced peak near to the if the likelihood has a pronounced peak near to the
maximum then small changes in parameters maximum then small changes in parameters would cause large changes in probabilitywould cause large changes in probability
In this cases we say that observation has more In this cases we say that observation has more information about parametersinformation about parameters
Expressed as the second derivative (or curvature) Expressed as the second derivative (or curvature) of the log-likelihood functionof the log-likelihood function
If more than 1 parameter, then 2If more than 1 parameter, then 2ndnd partial partial deriviatives deriviatives
Standard Errors of EstimatesStandard Errors of Estimates Rate of change is the second derivative of Rate of change is the second derivative of
a function (e.g., velocity and acceleration)a function (e.g., velocity and acceleration) Hessian Matrix is the matrix of 2Hessian Matrix is the matrix of 2ndnd partial partial
derivatives of the -log-likelihood functionderivatives of the -log-likelihood function
The entries in the Hessian are called the The entries in the Hessian are called the observed observed informationinformation for an estimate for an estimate
Standard ErrorsStandard Errors Information is used to obtained the Information is used to obtained the
expected variance (or standard error) or expected variance (or standard error) or the estimated parametersthe estimated parameters
When sample size becomes large then When sample size becomes large then maximum likelihood estimator becomes maximum likelihood estimator becomes approximately normally distributed with approximately normally distributed with variance close tovariance close to
More precisely…More precisely…
1( )I
1var( ) ( )I
Likelihood Ratio TestLikelihood Ratio Test Let LLet LFF be the maximum of the likelihood be the maximum of the likelihood
function for an unrestricted modelfunction for an unrestricted model Let LLet LRR be the maximum of the likelihood be the maximum of the likelihood
function of a restricted model nested in the full function of a restricted model nested in the full modelmodel
LLFF must be greater than or equal to L must be greater than or equal to LRR Removing a variable or adding a constraint can only Removing a variable or adding a constraint can only
hurt model fit. Same logic as Rhurt model fit. Same logic as R22
Question: Does adding the constraint or Question: Does adding the constraint or removing the variable (constraint of zero) removing the variable (constraint of zero) significantly impact model fit? significantly impact model fit? Model fit will decrease but does it decrease more Model fit will decrease but does it decrease more
than would be expected by chance?than would be expected by chance?
Likelihood Ratio TestLikelihood Ratio Test Likelihood RatioLikelihood Ratio R = -2ln(LR = -2ln(LRR / L / LFF)) R = 2(log(LR = 2(log(LFF) – log(L) – log(LRR)) )) R is distributed as chi-square distribution with m R is distributed as chi-square distribution with m
degrees of freedomdegrees of freedom m is the difference in the number of estimated parameters m is the difference in the number of estimated parameters
between the two models.between the two models. The expected value of R is m, so if you get an R that is The expected value of R is m, so if you get an R that is
bigger than the difference in parameters then the bigger than the difference in parameters then the constraint hurts model fit.constraint hurts model fit.
More formally…. You should reference the chi-square table More formally…. You should reference the chi-square table with m degrees of freedom to find the probability of getting with m degrees of freedom to find the probability of getting R by chance alone, assuming that the null hypothesis is R by chance alone, assuming that the null hypothesis is true.true.
Likelihood Ratio ExampleLikelihood Ratio Example Go back to our simple regression exampleGo back to our simple regression example Does the variable (X) significantly improve Does the variable (X) significantly improve
our predictive ability or model fit?our predictive ability or model fit? Alternatively, does removing X or constraining Alternatively, does removing X or constraining
it’s parameter estimate to zero significantly it’s parameter estimate to zero significantly decrease prediction or model fit?decrease prediction or model fit?
Full Model: -2log-L = Full Model: -2log-L = 889.567 Reduced Model: -2log-L =1186.05-2log-L =1186.05 Chi-square critical value = 3.84Chi-square critical value = 3.84
Fit IndicesFit Indices Akaike’s information criterion (AIC)Akaike’s information criterion (AIC)
Pronounced “Ah-kah-ee-key”Pronounced “Ah-kah-ee-key”
KK is the number of estimated parameters in is the number of estimated parameters in our model. our model.
Penalizes the log-likelihood for using many Penalizes the log-likelihood for using many parameters to increase fitparameters to increase fit
Choose the model with the smallest AIC valueChoose the model with the smallest AIC value
ˆ2 ( ) 2AIC LogL k
Fit IndicesFit Indices Bayesian Information Criterion (BIC)Bayesian Information Criterion (BIC)
AKA- SIC for Schwarz Information Criterion AKA- SIC for Schwarz Information Criterion
Choose the model with the smallest BICChoose the model with the smallest BIC the likelihood is the probability of obtaining the data the likelihood is the probability of obtaining the data
you did under the given model. It makes sense to you did under the given model. It makes sense to choose a model that makes this probability as large as choose a model that makes this probability as large as possible. But putting the minus sign in front switches possible. But putting the minus sign in front switches the maximization to minimizationthe maximization to minimization
ˆ2 ( ) log( )BIC LogL k n
Multiple RegressionMultiple Regression -Log-Likelihood function for multiple -Log-Likelihood function for multiple
regressionregression#Note, theta is a vector of parameters, with std.dev being the #Note, theta is a vector of parameters, with std.dev being the
first onefirst one#theta[-1] is all values of theta, except the first#theta[-1] is all values of theta, except the first#and here we're using matrix multiplication#and here we're using matrix multiplication
ols.lf3 <- function(theta, y, X) {ols.lf3 <- function(theta, y, X) { if (theta[1] <= 0) return(NA) if (theta[1] <= 0) return(NA) -sum(dnorm(y, mean = X %*% theta[-1], sd = -sum(dnorm(y, mean = X %*% theta[-1], sd = sqrt(theta[1]), log = TRUE))sqrt(theta[1]), log = TRUE))} }