Sta 216: Lecture 4 Last Class: Frequentist hypothesis ...Sta 216: Lecture 4 Last Class: Frequentist hypothesis testing through analysis of deviance, standard errors & con dence intervals

Sta 216: Lecture 4

Last Class: Frequentist hypothesis testing through analysis of

deviance, standard errors & confidence intervals

Today’s Class:

• Variable selection using stepwise selection

• AIC & BIC criteria for model selection

• Uncertainty in link function

• Introduction to Bayes inference in GLMs

1

Model Uncertainty

In most data analyses, there is uncertainty about the model & you

need to do some form of model comparison.

1. Select q out of p predictors to form a parsimonius model

2. Select the link function (e.g., logit or probit)

3. Select the distributional form (normal, t-distributed)

We focus first on problem 1 - Variable Selection

2

Frequentist Strategies for Variable Selection

It is standard practice to sequentially add or drop variables from a

model one at a time and examine the change in model fit

If model fit does not improve much (or significantly at some arbitrary

level) when adding a predictor, then that predictor is left out of the

model (forward selection)

Similarly, if model fit does not decrease much when removing a pre-

dictor, then that predictor is removed (backwards elimination)

Implementation: step() function in S-PLUS.

3

Illustration

• Suppose we have 30 candidate predictors, X1, . . . , X30.

• The true model is

logit Pr(yi = 1 |xi) = x′iβ,

where xi = (xi1, . . . , xi,30)′ (i = 1, . . . , n = 100)

• Note that predictors with β = 0 are effectively not in the model

• Let γj = 1(βj 6= 0) be a 0/1 indicator that the jth predictor is

included

4

• Interest focuses on selection of the subset of q ≤ p important

predictors out of the p candidate predictors:

Xγ = {Xj : γj = 1, βj 6= 0} ∼ (q × 1).

• There is a list of 2p possible models corresponding to the different

subsets Xγ ⊂ X.

• The size of the model list, 2p, is calculated by noting that we can

either include or exclude each of the p candidate predictors.

5

• LettingM denote the model list, we can introduce a model index

M ∈M

• The trick is how to decide between competing models M ∈ M

and M ′ ∈M.

• An important issue is also efficient searching of the (potentially

huge!) model spaceM for good models

• Often not possible to visit and perform calculations for all models

inM (230 = 1073741824)

6

• Returning to the logistic regression with 30 candidate predictors

example, we simulate data & run stepwise selection

• Simulation conditions: n = 100, q = 5, p = 30,

xi ∼ Np(0, Ip),

X1 −X5 have coefficients βγ = (1, 2, 3, 4, 5)/2,

X6 −X30 have coefficients 0.

• S-PLUS implementation:

n<- 1000 # sample size

p<- 30 # number of candidate predictors

X<- matrix(0,n,p) # simulate predictor values

for(j in 1:p) X[,j]<- rnorm(n,0,1)

Xl<- data.frame(X)

beta<- c((1:5)/2,rep(0,25)) # true values of parameters

py<- 1/(1+exp(0.5-X%*%beta)) # probability of response

# note that intercept is assumed to be -0.5

7

y<- rbinom(n,1,py) # simulate response from true model

fit.true<- glm(y ~ X.1 + X.2 + X.3 + X.4 + X.5,

family=binomial, data=Xl) # fit true model

X2<- Xl[,ind] # scramble order of predictors

fit.full<- glm(y ~ ., family=binomial, data=X2)

# implement stepwise selection using AIC criteria

fit.step<- step(fit.full, scope=list(upper = ~., lower=~1),

trace=F) # intercept in all

fit.step$anova # summarize results

8

• Comments: start with full model & performs backwards elimi-

nation - removes predictors sequentially which reduce AIC

• Definition of Akaike Information Criterion (AIC):

AIC = −2 maximized log likelihood + 2# parameters

• Why put in the number of parameters?

9

• Results from stepwise analysis of simulated data:

Stepwise Model Path

Analysis of Deviance Table

Initial Model:

y ~ X.4 + X.15 + X.26 + X.13 + X.16 + X.25 + X.9 + X.3 +

X.18 + X.29 + X.1 + X.30 + X.24 + X.11 + X.12 + X.7 +

X.21 + X.10 + X.17 + X.8 + X.5 + X.28 + X.22 + X.27 +

X.2 + X.23 + X.6 + X.14 + X.19 + X.20

Final Model:

y ~ X.4 + X.15 + X.26 + X.25 + X.3 + X.1 + X.12 + X.17 +

X.5 + X.28 + X.22 + X.2

10

Step Df Deviance Resid. Df Resid. Dev AIC

1 969 528.2075 590.2075

2 - X.9 1 0.018536 970 528.2260 588.2260

3 - X.20 1 0.020528 971 528.2466 586.2466

4 - X.11 1 0.024500 972 528.2711 584.2711

5 - X.18 1 0.045945 973 528.3170 582.3170

6 - X.13 1 0.057255 974 528.3743 580.3743

7 - X.16 1 0.080202 975 528.4545 578.4545

8 - X.24 1 0.100360 976 528.5548 576.5548

9 - X.21 1 0.278245 977 528.8331 574.8331

10 - X.30 1 0.296931 978 529.1300 573.1300

11 - X.27 1 0.440982 979 529.5710 571.5710

12 - X.10 1 0.445923 980 530.0169 570.0169

13 - X.7 1 0.447855 981 530.4648 568.4648

11

14 - X.23 1 0.625145 982 531.0899 567.0899

15 - X.8 1 1.000717 983 532.0906 566.0906

16 - X.6 1 1.025047 984 533.1157 565.1157

17 - X.29 1 1.366004 985 534.4817 564.4817

18 - X.19 1 1.445814 986 535.9275 563.9275

19 - X.14 1 1.623275 987 537.5508 563.5508

12

• Selected model contains predictors X1 − X5 belonging to true

model (we got lucky & coefficients large)

• Also contains predictorsX12, X15, X17, X22, X25, X26, X28, which

should have been excluded (not so good)

• Results of maximum likelihood estimation for true model:

Value Std. Error t value

(Intercept) -0.4494802 0.1106592 -4.061844

X.1 0.5525122 0.1115489 4.953095

X.2 0.9541896 0.1176940 8.107376

X.3 1.7647030 0.1489397 11.848442

X.4 2.0024375 0.1638457 12.221483

X.5 2.8375060 0.2018184 14.059703

13

• Results of maximum likelihood estimation for full model (focus-

ing on important predictors):


(Intercept) -0.47183494 0.1181678 -3.9929235

X.1 0.58449537 0.1214352 4.8132281

X.2 1.06182854 0.1289719 8.2330242

X.3 1.91001906 0.1626665 11.7419354

X.4 2.13608265 0.1772401 12.0519146

X.5 3.04082332 0.2230194 13.6347913

• Full model MLEs differ somewhat from MLEs under true model

and standard errors are higher

14

• Now, consider the estimates from the final selected model:


(Intercept) -0.4661101 0.1148807 -4.057341

X.1 0.5675410 0.1157651 4.902522

X.2 1.0170453 0.1233606 8.244489

X.3 1.8653303 0.1574395 11.847921

X.4 2.1040769 0.1731561 12.151332

X.5 2.9939133 0.2168262 13.807893

X.12 0.1725280 0.1124232 1.534630

X.15 0.1593174 0.1071552 1.486790

X.17 -0.2245846 0.1059882 -2.118958

X.22 -0.2398270 0.1128883 -2.124464

X.25 -0.1953539 0.1166990 -1.673999

X.26 -0.2685661 0.1112931 -2.413143

X.28 -0.1726031 0.1110436 -1.554373

15

• Comparing the parameters for the predictors that should have

been included to the earlier results, we find results in between

those for true model and full model.

• Interesting result: many of the predictors known to be unimpor-

tant have coefficients “significantly” different from zero

• This is a very common occurrence when using stepwise selection!

• If we had chosen smaller values for the coefficients for the im-

portant predictors, we would have missed one or more of them

(based on repeating simulation)

16

Some Issues with Stepwise Procedures

Normal Linear & Orthogonal Case

• In normal models with orthogonal X, forward and backwards

selection will yield the same model (i.e., the selection process is

not order-dependent).

• However, the selection of the significance level for inclusion in

the model is arbitrary and can have a large impact on the final

model selected. Potentially, one can use some goodness of fit

criterion (e.g., AIC, BIC).

• In addition, if interest focus on inference on the β’s, stepwise

procedures can result in biased estimates and invalid hypothesis

tests (i.e., if one naively uses the final model selected without

correction for the selection process)

17

Issues with Stepwise Procedures (General Case)

For GLMs other than orthogonal, linear Gaussian models, the order

in which parameters are added or dropped from the model can have

a large impact on the final model selected.

This is not good, since choices of the order of selection are typically

(if not always) arbitrary.

Model selection is a challenging area, but there are alternatives

18

Goodness of Fit Criteria

There are a number of criteria that have been proposed for comparing

models based on a measure of goodness of fit penalized by model

complexity

1. Akaike’s Information Criterion (AIC):

AICM = DM + 2pM ,

where DM is the deviance for model M and pM is the number

of predictors

2. Bayesian Information Criterion (BIC):

BICM = DM + pM log(n).

19

Note that deviance decreases as variables are added and the

likelihood increases.

The AIC and BIC differ in the penalty for model complexity, with

the AIC using twice the number of parameters and the BIC using

the number of parameters multiplied by the logarithm of the sample

size.

The BIC tends to place a larger penalty on the number of predictors

and hence more parsimonius models are selected.

20

Uncertainty in the link function & distribution

• Note that we can similarly use the AIC or BIC criteria to select

the link function or distribution

• For example, suppose that we have binary outcome data

• There are many possible link functions - any smooth, monotone

function mapping from < → [0, 1] (i.e., cumulative distribution

functions for continuous densities)

21

• We simulated data from a logistic regression model with xi =

(1, dosei), where dosei ∼ Uniform(0, 1), i = 1, . . . , 100.

• The parameters were chosen to be β = (−3, 5)

• As an alternative to the logistic model, we considered the probit:

Pr(yi = 1 |xi) = Φ(x′iβ),

where Φ(z) =∫ z−∞(2π)−1/2 exp(−z2/2) is the standard normal

cdf

• We also considered the complementary log-log model:

Pr(yi = 1 |xi) = 1− exp { − exp(x′iβ)}.

22

dose

Pr(

resp

onse

)

0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

23

Summary of results

• Logistic regression results:

β̂ = (−3.67, 6.14), AIC = 94.0211, BIC = 99.23145

• Probit regression results:

β̂ = (−2.14, 3.60), AIC = 93.9993, BIC = 99.20964

• Complementary log-log results:

β̂ = (−3.14, 4.39), AIC = 94.21893 BIC = 99.42927

• Low values of AIC & BIC are preferred, so we select the probit

model (which happens to be wrong)

24

Some things to think about

• It is very often the case that many models are consistent with

the data

• This is particularly true when there is a large number of models

in the list of plausible models

• Ideally, substantive information can be brought to bear to reduce

the size of the list

• For example, certain models may be more consistent with biology

25

• However, one is typically still faced with many possible models

& concerned about sensitivity of inferences to the model chosen

or selected by some algorithm

• Given the selection bias that occurs, it may be better to focus

a priori on a single model rather than run stepwise selection

• However, better yet would be to formally account for model un-

certainty (e.g., through Bayesian model averaging)

26

Bayesian Inference in GLMs

Frequentists typically base inferences on MLEs, asymptotic confi-

dence limits, and log-likelihood ratio tests

Bayesians base inferences on the posterior distribution of the

unknowns of interest.

The posterior quantifies the current state of knowledge about the un-

knowns, incorporating subject matter knowledge, past data, and the

current data.

Expert knowledge and past data are incorporated through a prior

distribution, while the current data are incorporated through the

likelihood function.

27

Bayesian inference in linear regression models

• Likelihood:

L(y; x,β, τ ) =n∏i=1

(2πτ−1)−1/2 exp{− τ

2(yi − x′iβ)2

},

where τ = σ−2 is the error precision

• Commonly-used prior:

π(β, τ ) = π(β)π(τ ) = Np(β; β0,Σ0)G(τ ; aτ , bτ ).

Here, the p × 1 vector of regression coefficients are assigned a

multivariate normal prior:

π(β) = |2πΣ0|−p/2 exp{− 1

2(β − β0)′Σ−1

0 (β − β0)′},

where β0 is the prior mean and Σ0 is the prior covariance

28

The hyperparameters β0,Σ0 quantify our state of knowledge

about the regression parameters β prior to observing the data

from the current study

In particular, β0 is our best guess for β before looking at the

current data & Σ0 expresses uncertainty in this guess

29

The prior for the error precision follows the gamma density

π(τ ) =baττ

Γ(aτ )τ aτ−1 exp(−bττ ),

which has expectation E(τ ) = aτ/bτ and variance V(τ ) = aτ/b2τ

Hyperparameters aτ , bτ are chosen to express knowledge about

τ

A non-informative prior is achieved by letting π(β, τ ) ∝ τ−1

(in limit as prior variance →∞)

30

Bayesian updating & posterior distributions

• After specifying the prior, we update the prior to incorporate

information in the likelihood using Bayes rule.

• This updating process yields the posterior distribution:

π(β, τ |y,x) =π(β, τ )L(y; x,β, τ )∫

π(β, τ )L(y; x,β, τ )dβdτ=π(β, τ )L(y; x,β, τ )

π(y; x),

where π(y; x) is the marginal likelihood of the data (obtained

by integrating the likelihood across the prior for the parameters)

• The expression for the posterior holds for any GLM, letting τ

denote the scale parameter and L(y; x,β, τ ) the exponential

family likelihood

31

• The primary difference between Bayesian inference in normal

linear regression models and in other GLMs is computational

• In particular, for normal linear regression models with conjugate

priors, the posterior distribution has a simple form.

• The posterior for the regression coefficients can be derived as

follows: π(β |y,x, τ )

∝ π(β)L(y; x,β, τ )

∝ exp{− 1

2(β − β0)′Σ−1

0 (β − β0)′}

exp{− 1

2

n∑i=1τ (yi − x′iβ)2

}

∝ exp[− 1

2

{β′(Σ−1

0 + τn∑i=1

xix′i)β − 2β(β0 + τ

n∑i=1

xiyi)}]

∝ Np(β; β̂, Σ̂β),

32

• Thus, the posterior distribution of β given τ is multivariate

normal

• The posterior mean is

β̂ = E(β | τ,y,X) = Σ̂β(Σ−10 β0 + τX′y)

• The posterior variance is

Σ̂β = V(β | τ,y,X) = (Σ−10 + τX′X)−1.

• Note that in the limiting case as the prior variance increases,

β̂ → (X′X)−1X′y, which is simply the least squares estimator

or MLE

• Hence, the posterior mean is shrunk back towards the prior mean

β0 to a degree dependent on the prior variance

33

• We can similarly derive the posterior distribution of τ : π(τ |y,X,β):

∝ π(τ )L(y; X,β, τ )

∝ τ aτ−1 exp(−bττ )(τ−1)−n/2 exp{− τ

2(yi − x′iβ)2

}

∝ τ aτ+n/2−1 exp[− τ

{bτ +

1

2

n∑i=1

(yi − x′iβ)2}]

∝ G(τ ; aτ +

n

2, bτ +

1

2

n∑i=1

(yi − x′iβ)2).

• Thus, the error precision has a gamma posterior distribution

34

Example: Bayesian updating

• Suppose for example that we have a simple linear regression

model

yi = β0 + β1dosei + εi, εi ∼ N(0, 1)

• We simulate data under the true model: β = (−1, 2), n = 25,

dosei ∼ Uniform(0, 1)

• We consider priors π(β) ∝ 1 & π(β) = N(0, I2)

35

slope parameter (beta_2)

dens

ity

−2 0 2 4

0.0

0.2

0.4

0.6

N(0,1) prior

Posterior, uniform prior

Posterior, N(0,1) prior

36

Some comments

• For a non-informative, uniform prior posterior is centered on the

least squares estimator (specific to normal linear models)

• For an informative prior, posterior mean is shrunk back towards

prior mean and posterior variance decreases

• As sample size increases, the contribution of the prior is swamped

out by the likelihood

37

• Hence, as n → ∞, the posterior will be centered on the MLE

regardless of the prior & frequentist/Bayes inferences will be

similar

• However, for finite samples, there can be substantial differences

• Choosing a N(0,I) prior results in a type of shrinkage estimator

38

• Since the result of Stein (1955) and James & Stein (1960) (MLE

is inadmissible for p ≥ 3), shrinkage estimators have been very

popular

• Choosing a N(0, κIp) prior for β, results in a ridge regression

(Hoerl and Kennard, 1970) estimator

• Hence, priors for the regression parameters having diagonal co-

variance are commonly referred to as ridge regression priors.

• For a recent article on shrinkage estimators and properties, refer

to Maruyama & Strawderman (2005, Annals of Statistics 33,

1753-1770).

39

• This and other articles have shown excellent properties for Bayes

and empirical Bayes estimators

• Schools of thought: subjective Bayesians argue that the prior

should be chosen based on one’s own belief and state of knowl-

edge about the unknowns

• Objective Bayesians argue that priors should be chosen (at least

in cases in which real substantive information is not available)

to yield good frequentist and Bayesian properties

• empirical Bayes instead uses the data in estimating the prior

distribution - this would be cheating for a fully Bayes procedure!

40

Documents

Sta 216: Lecture 4 Last Class: Frequentist hypothesis ...Sta 216: Lecture 4 Last Class: Frequentist hypothesis testing through analysis of deviance, standard errors & con dence intervals