Elements of Statistical Inference Theme of the workshop (and book): Analyzing HMs using both classical and Bayesian methods. “Dual inference paradigm”

Elements of Statistical Inference

Theme of the workshop (and book): Analyzing HMs using both classical and Bayesian methods.

“Dual inference paradigm”

Topics covered here:• Classical inference (likelihood, frequentist)• Bayesian inference (posterior distribution)• Implementation in R (both MLE and MCMC)• Case Study: logistic regression (not a HM)• Case Study: Occupancy model (a HM)

Inference for statistical models

• Parametric inference: explicit probability assumptions about data. Inference proceeds assuming the model is truth. (not an approximation to truth, but actual truth)

• Two flavors– Classical inference – Bayesian inference

Bayesian vs. Classical/Frequentist

• Classical inference– Likelihood estimation (‘method of maximum likelihood’)– Frequentists use a relative frequency interpretation in which

procedures are evaluated w.r.t. repeated realizations of the data. Probability is used to characterize how well procedures do, but not uncertainty about model parameters.

• Bayesian inference– Posterior inference: requires specification of a prior distribution– Bayesians make probability statements directly about model

parameters, conditional on the single data set that you have

Notation

• Random variables: etc.– These always have probability distributions whether you’re

a Bayesian or not

• Parameters: etc. have probability distributions if you’re a Bayesian but not otherwise

• Distributions: etc..• Bracket notation: etc..• Note: (vertical bar) means conditional

Classical inference

• You probably know this, but we review the basic ideas. And we show some technical elements in R to demystify what is being done in unmarked

What is the likelihood?• Observations: random variables that you might observe, • The joint distribution of these random variables: [Independence!]

• The likelihood is the joint distribution regarded as a function of new notation:

• The value of that produces the highest value of the likelihood is the maximum likelihood estimator, MLE,

Footnote: the joint distribution by itself is a function of , and is an index that changes it’s form or shape somehow. But we think of the likelihood the other way around: for the fixed value of y, what is the value of the joint distribution function for different values of ?

Example 1: Two independent binomial counts

# 2 binomial observationsy<- rbinom(2,size=10, p = .5)# The joint distribution function. As a function of y it gives# the probability of any two values of y1 and y2jointdis<- function(data,K,p){ prod(dbinom(data, size=K, p=p))}(jointdis(y, K=10, p = .5))# also is the likelihood of p = .5 for the# given data, but it is NOT a probability for p.# Evaluate the likelihood for a grid of values of "p"p.grid<- seq(.01,.99,,200)likelihood<- rep(NA,200)for(i in 1:200){ likelihood[i]<- jointdis(y, K = 10, p=p.grid[i])}# Plot the likelihoodplot(p.grid,likelihood,xlab="p", ylab="likelihood")

• It is not a probability distribution for even though it is called a 'likelihood', which sounds vaguely like 'probability'. That was a marketing gimmick.

Numerical maximization of the likelihood

• Numerical maximization of the likelihood was a HUGE change in applied statistics.

• Importance cannot be over-stated.• Don’t need formulas (explicit estimators)• Variances == numerically evaluated• Can do “marginal likelihood” by integrating

random effects out• Don’t need a statistician to do things for you

Properties of MLEs• MLEs are asymptotically normally distributed• The Hessian matrix = matrix of 2nd derivatives of w.r.t. (Fisher

information matrix). The inverse of is the “asymptotic variance-covariance matrix”.– Asymptotic standard error (ASE)– Based on normal approx. to the sampling distribution of the MLE– Numerical evaluation: Revolutionized statistics in the 1970s

• Asymptotic unbiasedness: as the bias of the MLE .• Minimum variance: as the variance of the MLE is the minimum

among all unbiased estimates.• Invariance to transformation: MLE of a function of a parameter is

just a function of the MLE of that parameter [note: variance is not invariant to transofmration]

Other elements of classical inference

Parametric bootstrapping

• Obtain MLEs for a model• Simulate data using those MLEs• Obtain MLEs for simulated data• Repeat many times• Use the distribution of the MLEs of simulated

data as an empirical estimate of the sampling distribution

Example 2: logistic regression

• Modeling species occurrence (usually presence or absence):– Binomial/Bernoulli observation model:

with

– And are independent• Likelihood = joint distribution regarded as a

function of :

Ordinary logistic regression in R# -------------------------- Simulate data ----------------------------# Create a covariate called vegHtnSites <- 100set.seed(2014) # so that we all get the same values of vegHtvegHt <- runif(nSites, 1, 3) # uniform from 1 to 3

# Suppose that occupancy probability increases with vegHt# The relationship is described by an intercept of -3 and# a slope parameter of 2 on the logit scale# plogis is the inverse-logit (constrains us back to the [0-1] scale)psi <- plogis(-3 + 2*vegHt)

# Now we go to 100 sites and observe presence or absence# Actually, let's just simulate the dataz <- rbinom(nSites, 1, psi)

General strategy for likelihood estimation: Express the negative log-likelihood as an R function and then use the standard function optim (or nlm) to minimize it:

# Definition of negative log-likelihood.negLogLike <- function(beta, y, x) { beta0 <- beta[1] beta1 <- beta[2] psi <- plogis(beta0 + beta1*x) # inverse-logit likelihood <- psi^y * (1-psi)^(1-y) # same as: # likelihood <- dbinom(y, 1, psi) return(-sum(log(likelihood)))}

# Look at (negative) log-likelihood for 2 parameter setsnegLogLike(c(0,0), y=z, x=vegHt)negLogLike(c(-3,2), y=z, x=vegHt) # Lower is better!

# Let's minimize it formally by function minimisationstarting.values <- c(beta0=0, beta1=0)opt.out <- optim(starting.values, negLogLike, y=z, x=vegHt, hessian=TRUE)(mles <- opt.out$par) # MLEs are pretty close to truth beta0 beta1 -2.539793 1.617025 # Alternative 1: Brute-force grid search for MLEsmat <- as.matrix(expand.grid(seq(-10,10,0.1), seq(-10,10,0.1))) # above: Can vary resolutionnll <- array(NA, dim = nrow(mat))for (i in 1:nrow(mat)){ nll[i] <- negLogLike(mat[i,], y = z, x = vegHt)}which(nll == min(nll))mat[which(nll == min(nll)),]

# Produce a likelihood surface, shown in Fig. 2-2.library(raster)r <- rasterFromXYZ(data.frame(x = mat[,1], y = mat[,2], z = nll))mapPalette <- colorRampPalette(rev(c("grey", "yellow", "red")))plot(r, col = mapPalette(100), main = "Negative log-likelihood", xlab = "Intercept (beta0)", ylab = "Slope (beta1)")contour(r, add = TRUE, levels = seq(50, 2000, 100))

# Alternative 2: Use canned R function glm as a shortcut(fm <- glm(z ~ vegHt, family = binomial)$coef) # Add 3 sets of MLEs into plot# 1. Add MLE from function minimisationpoints(mles[1], mles[2], pch = 1, lwd = 2)abline(mles[2],0) # Put a line through the Slope valuelines(c(mles[1],mles[1]),c(-10,10))# 2. Add MLE from grid searchpoints(mat[which(nll == min(nll)),1], mat[which(nll == min(nll)),2], pch = 1, lwd = 2) # 3. Add MLE from glm functionpoints(fm[1], fm[2], pch = 1, lwd = 2) # Note they are essentially all the same

-10 -5 0 5 10

-10

-50

51

0

Negative log-likelihood

Intercept (beta0)

Slo

pe

(b

eta

1)

500

1000

1500

2000

150

Asymptotic variance/SE

The hessian=TRUE option in the call to optim produces the Hessian matrix in the returned list opt.out, and so we can obtain the asymptotic standard errors (ASE) for the two parameters by doing this: Vc <- solve(opt.out$hessian) # Get variance-cov matrixASE <- sqrt(diag(Vc)) # Extract asymptotic SEsprint(ASE) beta0 beta1 0.8687444 0.4436064

Summary# Make a table with estimates, SEs, and 95% CImle.table <- data.frame(Est=mles, ASE = sqrt(diag(solve(opt.out$hessian))))mle.table$lower <- mle.table$Est - 1.96*mle.table$ASEmle.table$upper <- mle.table$Est + 1.96*mle.table$ASEmle.table Est ASE lower upperbeta0 -2.539793 0.8687444 -4.2425320 -0.8370538beta1 1.617025 0.4436064 0.7475564 2.4864933 # Plot the actual and estimated response curves plot(vegHt, z, xlab="Vegetation height", ylab="Occurrence probability")plot(function(x) plogis(beta0 + beta1 * x), 1.1, 3, add=TRUE, lwd=2)plot(function(x) plogis(mles[1] + mles[2] * x), 1.1, 3, add=TRUE, lwd=2, col="blue")legend(1.1, 0.9, c("Actual", "Estimate"), col=c("black", "blue"), lty=1, lwd=2)

1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

Vegetation height

Occu

rre

nce

pro

ba

bility

ActualEstimate

Work session

• Different ways of obtaining MLEs: grid search, optim(), glm()

• Get the asymptotic SE (ASE)• Plot a fitted response curve • Bootstrap

Bootstrappingnboot <- 1000 # Obtain 1000 bootstrap samplesboot.out <- matrix(NA, nrow=nboot, ncol=3)dimnames(boot.out) <- list(NULL, c("beta0", "beta1", "psi.bar")) for(i in 1:1000){ # Simulate data psi <- plogis(mles[1] + mles[2] * vegHt) z <- rbinom(M, 1, psi) # Fit model tmp <- optim(mles, negLogLike, y=z, x=vegHt, hessian=TRUE)$par psi.mean <- plogis(tmp[1] + tmp[2] * mean(vegHt)) boot.out[i,] <- c(tmp, psi.mean)}

Bootstrapping SE.boot <- sqrt(apply(boot.out, 2, var)) # Get bootstrap SEnames(SE.boot) <- c("beta0", "beta1", "psi.bar") # 95% bootstrapped confidence intervalsapply(boot.out,2,quantile,c(0.025,0.975)) beta0 beta1 psi.bar2.5% -4.490565 0.8379983 0.572807797.5% -0.978901 2.5974839 0.7828377 # Boostrap SEsSE.boot beta0 beta1 psi.bar 0.89156946 0.45943765 0.05428008 # Compare these with the ASEs for regression parametersmle.table Est ASE lower upperbeta0 -2.539793 0.8687444 -4.2425320 -0.8370538beta1 1.617025 0.4436064 0.7475564 2.4864933

Part II: Hierarchical Models• HMs have 1 or more “intermediate” models/levels/stages involving a

latent variable (random effect).

observable variable, “observation model”

latent variables, “process model''

• Two canonical examples: 1. Modeling species occurrence – “occupancy models” 2. Modeling species abundance – “N-mixture models” (and related)

Modeling species occurrence: Occupancy models

Observations: observations of presence/absence at site , sample for samples State variable: binary state-variable, true presence or absence

Observation model:

Same as probability of detecting species given that it is presentProcess model:

AKA: Bernoulli/Bernoulli HM. Also, a compound GLM (two GLMs linked together)

Modeling species abundance from counts: The N-mixture model

Observations: count of birds at point , sample State variable: state-variable, integer, population size at point Observation model:

population sizeprobability of encountering an individual

Process model:

AKA: A Binomial/Poisson HM. Also a compound GLM.

Likelihood inference for hierarchical models

Remove random effects from the conditional likelihood by integrating/summing the distribution of conditional on the random effect over possible states of the random effect:

• Integrated (or marginal) likelihood• Not a function of anymore• Maximize to obtain MLEs of • For discrete latent variable, replace by

Example: Occupancy model

$\bullet$ {\bf Observation model:}\begin{equation}y_{i} \sim \mbox{Binomial}(J, p*z_{i}) \end{equation}

$\bullet$ {\bf State model:}\[ z_{i} \sim \mbox{Bernoulli}(\psi_{i})\]\[ \mbox{logit}(\psi_{i}) = \beta_0 + \beta_{1} x_{i}\]

$\bullet$ What is the marginal likelihood for $y$?

Example: Occupancy model Computing the Marginal Likelihood• is a discrete random variable, having only 2 states, let’s use the

Law of total probability:

• Marginal likelihood for detection frequency

• AKA ‘zero-inflated binomial’. Can be maximized easily to obtain MLES.

• PRESENCE or unmarked function occu

Doing it in RnSites <- 100vegHt <- runif(nSites, 1, 3) # uniform from 1 to 3psi <- plogis(-3 + 2*vegHt)

# Now we simulate true presence/absence for 100 sitesz <- rbinom(nSites, 1, psi)

## Now generate observationsp<- 0.6J<- 3 # sample each site 3 timesy<-rbinom(nSites,J,p*z)

# This is the negative log-likelihood.negLogLikeocc <- function(beta, y, x,J) { beta0 <- beta[1] beta1 <- beta[2] p<- plogis(beta[3])

psi <- plogis(beta0 + beta1*x)

marg.likelihood <- dbinom(y, J,p)*psi + ifelse(y==0,1,0)*(1-psi) return(-sum(log(marg.likelihood)))}starting.values <- c(beta0=0, beta1=0,logitp=0)opt.out <- optim(starting.values, negLogLikeocc, y=y, x=vegHt,J=J,hessian=TRUE)

N-mixture modelThe strategy is the same: is a discrete variable and so the marginal likelihood is just a sum over the possible values of

Observation model:

Process model:

Marginal likelihood:

Continuous case: numerical integration

Continuous case: numerical integration

Snowshoe hare data, see Royle and Dorazio (2008, chapter 6)

# FREQUENCIES captured 0, J=14 times:nx<-c(14,34, 16, 10, 4, 2, 2,0,0,0,0,0,0,0,0)nind<-sum(nx)J<-14

Mhlik<-function(parms){ mu<-parms[1] sigma<-exp(parms[2])

il<-rep(NA,J+1)for(k in 0:J){il[k+1]<-integrate( function(x){ dbinom(k,J,plogis(x))*dnorm(x,mu,sigma) },lower=-Inf,upper=Inf)$value}-1*( sum(nx*log(il)) )}tmp<-nlm(Mhlik,c(-1,-1 ),hessian=TRUE)sqrt(diag(solve(tmp$hessian)))

Part III: Bayesian inference

Bayes’ rule

Bayes' Rule: A probability law that relates the conditional distributions and for any random variables and

• distribution of conditional on (think detection given occupancy)

• distribution of conditional on • marginal distribution of • marginal distribution of

Bayes’ rule, continued

Example: if occupancy state and number of detections in vists then probability of occupancy given non-detection in the visits.

plug-in known quantities:

Bayes’ rule, continued

Bayes' rule is not a controversial thing, it is just a basic law ofprobability. However, advocates of Bayesian inference assert itsgeneral use for inference about model parameters which are nottraditionally considered to be random variables

e.g., for data and some parameter

• “prior distribution”

Bayesian inferenceJoint distribution of the observations conditional on parameters:

Distribution for parameters -- prior distribution:

We can invoke Bayes’ rule , i.e., compute the conditional distribution:

• numerator = joint distribution of data and parameters• denominator = marginal distribution of the data

The Posterior distribution

• Arises from application of basic rules of probability, because everything is a random variable

• Is a probability distribution for the parameters!• Chararacterize uncertainty in the parameter values using explicit

probability statements• “Bayesian confidence interval” (usuallly “credible interval”)• In general, report summaries of the posterior distribution: mean, mode,

variance, etc..

Computing the posterior distribution

1. Do the math. Recognize the mathematical form of the posterior as a standard named distribution that we can compute moments of.

2. Monte Carlo approximation -- draw samples from the posterior distribution and quantify posterior features by summarizing the samples. Markov chain Monte Carlo (MCMC).

Computing the posterior distributionIn limited cases we can identify the posterior distribution analytically. e.g., if number of times we detected a species at some site, we assume and then

The MLE is , the posterior mean is , the posterior mode is (also this is the MLE).

If we use a uniform prior for ( then the posterior mode is equal to the MLE. This is a general result: if you use a uniform prior then there is a correspondence between posterior modes and MLEs.

Computing the posterior distribution: MCMC

The posterior:

Computing the denominator is computationally expensive, and sometimes not even possible:

(one integral for each parameter that has a prior distribution). • Usually not recognizable as a known distribution.• Can only be done analytically in very special cases.

0 11 2 1 2 0 1 0 1 0 1[ ] ( , , , ) ( , , , | , ) ( , ) .n nf y y y f y y y g d d

y

How to do Bayesian analysis: MCMC

• MCMC: simulation methods for sampling from the posterior distribution which do not require that we know the denominator, or ever have to evaluate it. We estimate features of the posterior distribution from the posterior samples.

• We calculate features of the posterior distribution from the posterior samples using Monte Carlo averages.

• e.g., if we obtain from the posterior distribution then

• The topic of MCMC is too vast to cover here, we cover only a couple basic ideas such as Metrpolis-Hastings (otherwise we use BUGS/JAGS!).

The Metropolis AlgorithmTarget distribution for random variable : . Note: in practice this is always the posterior distribution:

But it can be any distribution at all.

Step 0. Initialize is the current value of the parameter, i.e., at step of the algorithm.

Step 1. Draw candidate values of a parameter, , are simulated from some symmetric proposal distribution Symmetric if

– E.g., – E.g., # need not depend on

The Metropolis Algorithm

Step 2. Accept that value with probability related to the ratio of the target distribution evaluated at the candidate to that evaluated at the current value / # Note: cancels # Note: involves candidate generator if not symmetric

Acceptance probability =

Step 3. Repeat a few thousand times

The Metropolis Algorithm

Practical relevance: • The marginal distribution of (i.e., denominator

of the posterior or the conditional posterior) cancels, so we don't need to know what it is.

• To use the Metropolis algorithm we only have to evaluate known distributions that make up the posterior but not the posterior itself

Illustration

• Suppose you collect two observations and which are independent Binomial random variables with and unknown Obtain the posterior distribution of using the Metropolis algorithm

Illustration of MCMC using Metropolis Algorithm # 2 binomial observationsy<- rbinom(2,size=10, p = .5)

# The joint distribution function. As a function of data it gives# the probability of any two values of data=c(y1, y2) jointdis<- function(data,K,p){ prod(dbinom(data, size=K, p=p))}

(jointdis(y, K=10, p = .5)) # also happens to be the likelihood of the value p = .5 for the # given data, but it is NOT a probability for p.

# Evaluate the likelihood for a grid of values of "p"p.grid<- seq(.1,.9,,200)likelihood<- rep(NA,200)for(i in 1:200){ likelihood[i]<- jointdis(y, K = 10, p=p.grid[i])}# Plot the likelihoodplot(p.grid,likelihood,xlab="p", ylab="likelihood")

Illustration of MCMC using Metropolis Algorithm


Need a prior distribution for : where we assume and are fixed subjectively. i.e., not parameters to estimate.

The target distribution is the posterior distribution:

• ???? • Possibly we could figure this out, but why bother?


The target distribution is the posterior distribution:

This is proportional to the joint distribution (which was the likelihood) and the beta prior distribution, so we'll make an R function out of it:

# Define the joint distribution of the datajointdis<- function(data,K,p){ prod(dbinom(data, size=K, p=p))}# Posterior is proportional to likelihood times priorposterior<- function(data,K,p,a,b){ prod(dbinom(data, size=K, p=p))*dbeta(p,a,b)}


## Do 10000 MCMC iterations using the metropolis algorithm## Assume uniform prior which is beta(1,1)mcmc.iters<- 100000out<- rep(NA,mcmc.iters)# starting valuep<- .2for(i in 1:mcmc.iters){

# use a uniform candidate generator. This is not efficient p.cand <- runif(1,0,1) r<- posterior(y,K=10,p=p.cand,a=1,b=1)/posterior(y,K=10,p=p,a=1,b=1) # generate a uniform r.v. and compare with "r", this imposes the # correct probability of acceptance if(runif(1) < r) # This is how you “do something” with probability r p<- p.cand out[i]<- p}

Likelihood vs. posterior

The posterior of a function of a model parameter

• To estimate a function of a parameter (and it’s variance) you only have to apply that function to the posterior samples of the parameter.

• If is a posterior sample of some parameter then is a posterior sample of the parameter .

• As an exercise, estimate the posterior distribution of from the binomial example

Remarks on the Metropolis Algorithm

Heuristic: This algorithm has us simulate candidate valuessomehow, even arbitrarily, and then accept values that have higher posterior probability

• The long-run frequency of ``accepted'' values is that of the target posterior density!

• Note: If the prior is constant, this MCMC calculation isbased on repeated evaluations of the likelihood only. So, if you write a function to do MLE you can also do MCMC.

Summary thoughts on Bayesian/classical inference

Both inference paradigms useful for analysis of hierarchical models Bayesian: • Completely general methods for implementation (MCMC) which always

work. Sometimes BUGS implementations don't work, so its good to know how to do it.

• Bayes is great for complex models with lots of latent structure• Inferences are not asymptotic, applies to arbitrary . In particular, the that

you have.• Prediction/transformation is more coherent -- comes “for free”• Takes more math/programming know-how????• Sometimes slower due to more calculations Classical:• Integrated likelihood sometimes not feasible. (community model)• But very accurate (not simulation based, no MC error)• Automatic model selection (AIC)

Idealized Structure of Workshop/Book

• Introduction to a class of models• Likelihood analysis of models in unmarked• Stressing consistent work flow and ease of doing

standard things like prediction and model selection• Bayesian analysis in BUGS • Illustration of a type of model that can't be done

(easily, or in unmarked) using likelihood methods.

Documents

Elements of Statistical Inference Theme of the workshop (and book): Analyzing HMs using both classical and Bayesian methods. “Dual inference paradigm”