Sarah Filippi University of Oxford Hilary Term 2016filippi/Teaching/HM2016/Lecture.pdf · Outline Hierarchical models: frequentist and Bayesian approaches. 1 Introduction of the concepts

Hierarchical Models - Statistical Methods

Sarah Filippi1

University of Oxford

Hilary Term 2016

1With grateful acknowledgements to Neil Laws for his notes from 2012–13.

Information

Webpage: http://www.stats.ox.ac.uk/~filippi/

Lectures:

• Week 1 – Thursday 12, Friday 11

• Week 2 – Thursday 12, Friday 11

• Week 3 – Thursday 12

Practical:

• Week 3 – Friday

http://www.stats.ox.ac.uk/~filippi/

Outline

Hierarchical models: frequentist and Bayesian approaches.

1 Introduction of the concepts through examples

2 Inferring the parameters of a hierarchical linear model: thefrequentist approach

3 Inferring the parameters of a hierarchical linear model: theBayesian approach

4 Studying inferred Hierarchical Linear Models and cross-levelinteractions

5 Summary and Further properties of hierarchical models

We will rely on material that you have covered in: linear models,Bayesian statistics, R programming (e.g. lattice graphics), . . .

References

• Gelman, A. and Hill, J. (2007) Data Analysis UsingRegression and Multilevel/Hierarchical Models.Cambridge University Press.

• Jackman, S. (2009) Bayesian Analysis for the Social Sciences.Wiley.

• Pinheiro, J. C. and Bates, D. M. (2000) Mixed-EffectsModels in S and S-PLUS. Springer.

• Snijders, T. A. B. and Bosker, R. J. (2012) MultilevelAnalysis: An Introduction to Basic and Advanced MultilevelModelling. Second Edition. Sage.

• Venables, W. N. and Ripley, B. D. (2002) Modern AppliedStatistics with S. Fourth Edition. Springer.

1. Introduction of Hierarchical Modelsthrough examples

Rail: simple example of random effects

Example from Pinheiro and Bates (2000).

• Six railway rails were chosen at random and tested three timeseach.

• The Rail dataset records the time it took for an ultrasonicwave to travel the length of each rail.

• This is a one-way classification: each observation is classifiedaccording to the rail on which it was made.

## if necessary first install the package MEMSS,

## either using the Packages menu,

## or alternatively using something like:

## install.packages("MEMSS")

> data(Rail, package = "MEMSS")

> Rail

Rail travel

1 A 55

2 A 53

3 A 54

4 B 26

....

17 F 85

18 F 83

Think of the measurements made on the same rail as a “group” –so we have 6 groups (labelled as A,. . . ,F) with 3 measurements ineach group.

The original experimenters were interested in

• the travel time for a typical rail (expected travel time)

• the variation in times between different rails (between-groupvariability)

• the variation in times for a single rail (within-groupvariability).

library(lattice)

myrail <- with(Rail, reorder(Rail, travel))

dotplot(myrail ~ travel, data = Rail,

xlab = "Travel time (ms)", ylab = "Rail")

Travel time (ms)

Rai

l

B

E

A

F

C

D

40 60 80 100

●●●

● ●●

● ●●

● ●●

● ●●

● ●●

Between-group variability is much larger than within-groupvariability.

One possible model is

yij = βj + εij , i = 1, 2, 3, j = 1, . . . , 6

where

• i indexes the observations within a group

• j indexes the different groups

• yij is the travel time for observation i in group j

• the parameters β1, . . . , β6 are fixed unknown constants:βj is the effect of group j

• the εij are independent N(0, σ2) random errors.

However, this model only models the 6 specific rails that wereactually measured in this experiment,

• it does not model the population of rails from which thesample was drawn

• it does not provide an estimate of the between-rail variability

• the number of parameters of the model increases linearly withthe number of rails (for N rails we would have N parameters).

A hierarchical model treats the rail effects as random variationsabout an underlying population mean, as follows. We suppose that

yij = β + bj + εij

where

• β is a fixed unknown parameter representing the expectedtravel time across the whole population of rails

• bj is a random variable representing the deviation from thepopulation mean for rail j

• εij is a random variable representing the additional deviationfor the ith observation on rail j.

• We assume that, independently,

bj ∼ N(0, σ2b ) and εij ∼ N(0, σ2).

So σ2b denotes the between-group variability, and σ2 denotes

the within-group variability.

• The parameter β is called a fixed effect, and the bj are calledrandom effects. A model with both types of effect is oftencalled a mixed-effects model.

• The model above is a mixed-model: it has a simple fixed part(just an intercept), later examples will have more interestingfixed parts; it also has a random part (the bj).

The model has two sources of random variation:

• one of these sources is at the individual observation level (εij)

• the other source is at the group level (bj).

Remark: No matter how many rails we have, we will always havejust 3 parameters: β, σ2

b and σ2.

Vocabulary

• Since we have two sources of variation, at different levels,such models are called hierarchical models and also multilevelmodels.

• Models with more than two levels of variation are alsopossible.

• Such models are also called mixed-effects model.

• Another name is variance components.

In our model yij = β + bj + εij

• observations made on different groups are independent

• but observations within the same group share the same bj andare correlated.

The variance of each yij is

var(yij) = var(bj) + var(εij) = σ2b + σ2

and the covariance between observations i and i′ (where i 6= i′) ingroup j is

cov(yij , yi′j) = var(bj) = σ2b .

So the correlation between two observations from the same group is

σ2b

σ2b + σ2

.

Summary

The rail example allowed us to

• introduce the concept of between-group and within-groupvariability

• describe how to include random-effects in simples constantmodels

• define hierarchical, multi-level, mixed-effect . . . models.

Multilevel structures can be added to any type of regression modelincluding linear, logistic and generalised linear models.

In this lecture, we will mainly focus on Hierarchical linear models.

Orthodont: simple linear growth curves

Example from Pinheiro and Bates (2000).

• The Orthodont dataset is a set of measurements of thedistance from the pituitary gland to the pterygomaxillaryfissure taken every two years from age 8 until age 14 on asample of 27 children – 16 males and 11 females.

• Measurements of this type are sometimes referred to asgrowth curve data, or repeated measures or longitudinal orpanel data.

• For now, we restrict to the data from the female subjects.

• Think of the measurements made on the same subject as agroup.

> data(Orthodont, package = "MEMSS")

> Orthodont$mySubj <- with(Orthodont,

reorder(Subject, distance))

> orthf <- Orthodont[Orthodont$Sex == "Female", ]

> orthf

distance age Subject Sex mySubj

65 21.0 8 F01 Female F01

66 20.0 10 F01 Female F01

67 21.5 12 F01 Female F01

68 23.0 14 F01 Female F01

....

108 28.0 14 F11 Female F11

xyplot(distance ~ age | mySubj, data = orthf,

aspect = "xy", type = c("g", "p", "l"),

scales = list(x = list(at = c(8, 10, 12, 14))))

age

dist

ance

16

18

20

22

24

26

28

8 10 12 14

●

● ●●

F10

8 10 12 14

●

● ●

●

F06

8 10 12 14

●

●

●●

F09

8 10 12 14

●

●

●

●

F01

8 10 12 14

●

●●

●

F05

8 10 12 14

●●

●

●

F02

8 10 12 14

●

●●

●

F07

8 10 12 14

● ●●

●

F08

8 10 12 14

●

●●

●

F03

8 10 12 14

●

●●

●

F04

8 10 12 14

●●

● ●

F11

We consider the model

yij = β1 + bj + β2xij + εij

where bj ∼ N(0, σ2b ) and εij ∼ N(0, σ2).

Here

• yij is the distance for observation i on subject j

• the overall intercept for subject j is made up of a fixed part,the parameter β1, plus a random deviation bj representing thedeviation from this fixed intercept for subject j

• xij is the age of the subject when the ith observation is madeon subject j.

So we are regressing distance on age, and allowing a randomintercept for each subject. As before, observations on differentgroups (i.e. subjects) are independent, but observations within thesame group are dependent.

• This model assumes that the effect of age (the growth rate) isthe same for all subjects.

• We can fit a model with random effects for both the interceptand the effect of age:

yij = β1 + bj + cjxij + εij

where bj ∼ N(0, σ2b ), cj ∼ N(0, σ2

c ) and εij ∼ N(0, σ2).

• The random part is an intercept term plus an age term,conditional on subject. That is, we have a random growthrate, in addition to a random intercept, for each subject.

age

dist

ance

16

18

20

22

24

26

28

8 10 12 14

●

● ●●

F10

8 10 12 14

●

● ●

●

F06

8 10 12 14

●

●

●●

F09

8 10 12 14

●

●

●

●

F01

8 10 12 14

●

●●

●

F05

8 10 12 14

●●

●

●

F02

8 10 12 14

●

●●

●

F07

8 10 12 14

● ●●

●

F08

8 10 12 14

●

●●

●

F03

8 10 12 14

●

●●

●

F04

8 10 12 14

●●

● ●

F11

Within−subject Mixed model Population

Within-subject: 11 simple linear regressions, one per subject (moresensitivity to extreme observations than the mixed model)Population: one fixed model for the whole population

Summary: Multi-level regression modelling

Example from Gellman and Hill (2007)

• Consider an educational study with data from students inmany schools j = 1 . . . J

• We want to predict the students’ grades yij on a standardizedtest given their scores on a pre-test xij .

• A separate regression model can be fit within each school, andthe parameters from these schools can themselves be modeledas depending on school characteristics (such as thesocioeconomic status of the school’s neighborhood, whetherthe school is public or private, and so on).

• We can otherwise build a multilevel regression model

• The student-level regression and the school-level regressionhere are the two levels of a multilevel model.

Models for regression coefficients

• Varying-intercept model: the regressions have the same slopein each of the schools, and only the intercepts vary.

yij = β1 + bj + β2 xij + εij

where bj ∼ N(0, σ2b ) and εij ∼ N(0, σ2)

• Varying-slope model: the regressions have the same interceptsin each of the schools, and only the slopes vary.

yij = β1 + cj xij + εij

where cj ∼ N(0, σ2c ) and εij ∼ N(0, σ2)

• Varying-intercept, varying-slope model: intercepts and slopesboth can vary by school.

yij = β1 + bj + cj xij + εij


c ) and εij ∼ N(0, σ2)

Some other examples

From Gellman and Hill’s research (2007)

• Combining information for local decisions: home radonmeasurement and remediation

• Modeling correlations: forecasting presidential elections

• Small-area estimation: state-level opinions from national polls

• Social science modeling: police stops by ethnic group withvariation across precincts

Examples:

macro-level micro-level(i.e. groups) (i.e. observations)

schools teachersclasses pupilsneighborhoods familiesdistricts votersfirms employeesfamilies childrendoctors patientssubjects measurements

• So we are thinking about data with a nested structure: e.g.• students within schools• voters within districts.

• More than two levels are possible: e.g. students within classeswithin schools.

• Multilevel analysis is a suitable approach to take into accountthe social contexts as well as the individual respondents orsubjects.

• The hierarchical linear model is a type of regression model formultilevel data where the dependent variable is at the lowestlevel.

• Explanatory variables can be defined at any level (includingaggregates of micro-level variables, e.g. the average income ofvoters in a district).

2. Inferring the parameters of a hierarchicallinear model:

the frequentist approach

Three types of hierarchical linear models

• Varying-intercept model: the regressions have the same slopein each of the groups, and only the intercepts vary.

yij = β1 + bj + β2 xij + εij

• Varying-slope model: the regressions have the same interceptsin each of the groups, and only the slopes vary.

yij = β1 + (β2 + cj) xij + εij

• Varying-intercept, varying-slope model: intercepts and slopesboth can vary by group.

yij = β1 + bj + (β2 + cj) xij + εij


c ) and εij ∼ N(0, σ2)

Matrix representation of hierarchical linear models

[Remember: we can write a linear regression model in matrix formas y = Xβ + ε.]

Denote by:

• yj the vector of observations in group j

• β the vector of fixed parameters (with components β1, β2, ...)

• bj the vector of random effects for group j (with componentsb1j , b2j , ...)

• εj the vector of random errors for group j.

Then, any hierarchical linear model (with one level of grouping)can be written as

yj = Xjβ + Zjbj + εj

where the matrices Xj and Zj are the known fixed-effects andrandom-effects regressor matrices.

The following varying-intercept model

yij = b1j + β1xij + εij

(for j = 1, . . . , J and i = 1, . . . , nj) can be written in matrix form:

yj =

y1j

y2j

. . .ynjj

=

11. . .1

b1j+

x1j

x2j

. . .xnjj

β1+

ε1jε2j. . .εnjj

= Xjβ+Zjbj+εj

where bj = b1j , β = βj , Xj =

x1j

x2j

. . .xnjj

and Zj =

11. . .1

.

Similarly, the varying-intercept varying-slope model

yij = β1 + β2xij + b1j + b2jxij + εij

(for j = 1, . . . , J and i = 1, . . . , nj) can be written in matrix form:

yj =

y1j

y2j

. . .ynjj

=

1 x1j

1 x2j

. . . . . .1 xnjj

(β1

β2

)+

1 x1j

1 x2j

. . . . . .1 xnjj

(b1jb2j)

+

ε1jε2j. . .εnjj

= Xjβ + Zjbj + εj

Here:

Xj = Zj =

1 x1j

1 x2j

. . . . . .1 xnjj

Some dimensions

• Let the groups be j = 1, . . . , J , and assume we have njobservations in group j. We can take

yj = Xjβ + Zjbj + εj j = 1, . . . , J

as a general form of the hierarchical linear model, providedthat we allow β and bj to have any dimensions

• Let β contain p parameters, and let each bj be a randomvector of length q.

• The matrices Xj and Zj are known fixed-effects andrandom-effects regressor matrices:

• the columns of Xj are the values of the explanatory variablesfor group j (Xj is nj × p)

• The columns of Zj are usually a subset of the columns of Xj

(Zj is nj × q)

Typically:

• We assume that the components of the vector εj are i.i.d.

εj ∼ N (0, σ2I) .

• The vectors bj are assumed to be independent and normallydistributed with mean 0:

bj ∼ N (0,Ψ)

• However we do not assume that the components of bj areindependent – we make no assumptions about the correlationsof the components of bj , (these are estimated).

• bj and εj are mutually independent and independent acrossgroups

We aim at estimating the parameters β, σ2 and Ψ.

Here we focus on a model with a ”single level of grouping”.See Pinheiro and Bates (2000) for details and more general case.

Likelihood function

The likelihood function is the probability density of the data giventhe parameters, regarded as a function of the parameters:

L(β, σ2,Ψ; y) = p(y|β, σ2,Ψ) .

Because the random-effect variables bj , j = 1 . . . J , are part of themodel, we must integrate them out:

L(β, σ2,Ψ; y) =

J∏j=1

p(yj |β, σ2,Ψ)

=

J∏j=1

∫p(yj |bj , β, σ2,Ψ)p(bj |Ψ)dbj

Which assumption have we used here?

Given thatyj |bj , β, σ2 ∼ N (Xjβ + Zjbj , σ

2I)

andbj ∼ N (0,Ψ)

we obtain

L(β, σ2,Ψ; y) =

J∏j=1

∫1

(2πσ2)nj/2exp

(− 1

2σ2||yj −Xjβ − Zjbj ||2

)1

(2π)q/2√|Ψ|

exp(−bTj Ψ−1bj

)dbj

Maximum Likelihood (ML)

Maximise the likelihood function by:

1 maximising it with respect to β and σ2 (similar to linearregression model),

2 replacing β and σ2 by their estimates in the likelihoodfunction,

3 maximising the remaining objective function (called profilelikelihood) with respect to Ψ.

Remarks:

• Maximum likelihood estimates of variance components arebiased and tend to underestimate these parameters

• If you are interested in estimating components of variancethen it is better to use another methods (see next slide).

Restricted Maximum Likelihood (REML)

• Also called ”Reduced” or ”Residual” Maximum Likelihood

• Maximise the following quantity

LR(Ψ, σ2; y) =

∫L(β,Ψ, σ2; y)dβ

1 by treating σ2 as if known and maximizing objective functionwith respect to Ψ:

Ψσ = argmaxΨ LR(Ψ, σ2; y)

2 using the estimate of Ψ to estimate σ2(Ψ):

σ = argmaxσ LR(Ψσ, σ2; y)

The estimates Ψσ and σ maximise LR(Ψ, σ2; y).

3 β is then estimated by finding its expected value with respectto the posterior distribution

ML vs REML

• REML makes a less biased estimation of random effectsvariances than ML

• The REML estimates of Ψ and σ2 are invariant to the valueof β and less sensitive to outliers in the data compared to MLestimates.

• If you use REML to estimate the parameters, you can onlycompare two models that have the identical fixed-effectsdesign matrices and are nested in their random-effects terms.

Hierarchical models in R

We could use:

• nlme – “Fit and compare Gaussian linear and nonlinearmixed-effects models.”

A recommended package for R, stable and well-established,documented in Pinheiro and Bates (2000).

Choose method = "ML" or method = "REML"

• lme4 – “Fit linear and generalized linear mixed-effectsmodels.”

More efficient than nlme, uses a slightly different syntax, stillbeing developed.

We use nlme. (However we will not use groupedData objects.)

(If you do try lme4, avoid using both nlme and lme4 within thesame R session.)

The orthodont example

We consider the model


where bj ∼ N(0, σ2b ) and εij ∼ N(0, σ2).

Here

• yij is the distance for observation i on subject j

• xij is the age of the subject when the ith observation is madeon subject j.

• we are regressing distance on age, and allowing a randomintercept for each subject (focus on female).

We can fit the model using the lme function as follows:

orthf1.lme <- lme(distance ~ 1 + age,

random = ~ 1 | mySubj, data = orthf)

> summary(orthf1.lme)

....

Random effects:

Formula: ~1 | mySubj

(Intercept) Residual

StdDev: 2.0685 0.78003

Fixed effects: distance ~ 1 + age

Value Std.Error DF t-value p-value

(Intercept) 17.3727 0.85874 32 20.2304 0

age 0.4795 0.05259 32 9.1186 0

....

The parameter estimates are:

σb = 2.0685, σ = 0.78003, β1 = 17.3727, β2 = 0.4795.

The estimated distribution of the intercept is: N (β1, σ2b )

The estimated distribution of the residual is: N (0, σ2)

Likelihood and Fixed Effects Estimates

We obtain the likelihood as follows:

> logLik(orthf1.lme)

’log Lik.’ -70.609 (df=4)

For orthf1.lme, the estimated fixed effects β1 and β2 are

> fixef(orthf1.lme)

(Intercept) age

17.37273 0.47955

The average distance as a function of age x is

β1 + β2x

Prediction of random effects

The random effects ”estimates” bj are

> ranef(orthf1.lme)

(Intercept)

F10 -4.005329

F06 -1.470449

....

F11 3.599309

These ”estimates” bj are in fact best linear unbiased predictorsBLUPs of the random effects, also called the conditional modes (orconditional means). The random effects bj are actually randomvariables rather than unknown parameters, and normally we speakof predicting (rather than estimating) random variables.

These ”estimates” measure the deviation from the averageintercept for each group. The estimated average distance in groupj as a function of age x is equal to β1 + bj + β2x.

To get the coefficients for each subject, i.e. β1 + bj and β2, we canuse

> coef(orthf1.lme)

(Intercept) age

F10 13.367 0.47955

F06 15.902 0.47955

....

F11 20.972 0.47955

Again, it is very important to keep in mind that these coefficientsare predictions and not estimates.

Tests of random effects

Now consider the model with an additional random effect in theslope:

yij = β1 + b1j + (β2 + b2j)xij + εij

where yij is the distance for observation i on subject j.

We fit this model in R as follows:


random = ~ 1 + age | mySubj, data = orthf)

What are the parameters of this model?

• two fixed-effect β1, β2

• the variance of εij denoted by σ

• the variances of b1j and b2j denoted respectively by σ21 and σ2

2

• and the covariance of b1j and b2j denoted by σ12.

Tests of random effects

Compared with the more complex model orthf2.lme, is thesimpler model orthf1.lme adequate?

• These two models are nested: orthf1.lme is a special case oforthf2.lme.

• We can use the Likelihood ratio test to compare nestedmodels.

If L2 is the maximised likelihood of the more general model (e.g.orthf2.lme) and L1 is the maximised likelihood of the simplermodel (e.g. orthf1.lme), then the test statistic is

2 log(L2/L1) = 2[logL2 − logL1].

Let ki be the number of parameters of model i (e.g. k1 = 4,k2 = 6). Then under the null hypothesis that the simpler model istrue, the large sample distribution of this test statistic isapproximately χ2 with k2 − k1 degrees of freedom.

We can perform this test using the anova function:

> anova(orthf1.lme, orthf2.lme)

df AIC BIC logLik Test L.Ratio p-value

orthf1.lme 4 149 156 -70.609

orthf2.lme 6 149 159 -68.714 1 vs 2 3.78 0.1503

p-value=p(χ2k2−k1 ≥ 2 log(L2/L1))

So we do not have enough evidence to reject the null hypothesisthat the simpler model is adequate.

Conditions to use Likelihood ratio tests (LRT)

• one model must be nested within the other

• the fitting method, i.e. REML or ML, must be the same forboth models (and if using REML the fixed parts of the twomodels must be the same – need to fit using ML otherwise).

• Under the null hypothesis the parameter σ22 is set to zero,

which is on the boundary of the set of possible values for σ22.

A consequence is that, in addition to being approximate (sincebased on an asymptotic result) p-values such as the one abovetend to be conservative – i.e. the p-value above may be largerthan the true p-value.

• Possible to check on the distribution of the likelihood ratiotest under the null hypothesis through simulation.

See Pinheiro and Bates (2000), Section 2.4.1, and Snijders andBosker (2012), Section 6.2.1.

We can also compare against model with no random effects:


where bj ∼ N(0, σ2b ) and εij ∼ N(0, σ2) v.s.

yij = β1 + β2xij + εij .

> orthf.lm <- lm(distance ~ 1 + age, data = orthf)

> orthf1ML.lme <- lme(distance ~ 1 + age,

random = ~ 1 | mySubj,

data = orthf, method = "ML")

> anova(orthf1ML.lme, orthf.lm)

Model df AIC BIC logLik Test L.Ratio p-value

orthf1ML.lme 1 4 146.03 153.17 -69.0

orthf.lm 2 3 196.76 202.11 -95.4 1 vs 2 52.725 <.0001

The random effects are an important part of our model.

Tests of fixed effects

We can test the effect of a single fixed parameter using anapproximate test for which the test statistic is

βk

std. error(βk).

> summary(orthf1.lme)

....

Fixed effects: distance ~ 1 + age


(Intercept) 17.3727 0.85874 32 20.2304 0

age 0.4795 0.05259 32 9.1186 0

....

Both fixed-effects terms are clearly significant.

Likelihood ratio tests of fixed effects are also possible – estimationmust be via maximum likelihood.Use full Orthodont dataset from Pinheiro and Bates (2000).

> orth1ML.lme <- lme(distance ~ 1 + age,

random = ~ 1 + age | Subject,

data = Orthodont, method = "ML")

> orth2ML.lme <- lme(distance ~ 1 + age*Sex,


data = Orthodont, method = "ML")

> anova(orth1ML.lme, orth2ML.lme)

Model df AIC BIC logLik Test L.Ratio p-value

orth1ML.lme 1 6 451.21 467.30 -219.61

orth2ML.lme 2 8 443.81 465.26 -213.90 1 vs 2 11.406 0.0033

The p-value is small, so we prefer orth2ML.lme. (Pinheiro andBates (2000), Section 2.4.2, are very cautious about LRTs of thistype: they show that such tests can be “anti-conservative”, givingp-values that are too small.)

Confidence intervals

• Confidence intervals on the variance-covariance componentsand the fixed effects can be obtained using approximatedistribution for the maximum likelihood estimates and theRMLE estimates

• See Pinheiro and Bates (2000), Section 2.4.3

• Use function intervals in R.

Confidence intervals

Recall that β1 = 17.3727, β2 = 0.4795, σb = 2.0685, σ = 0.78003.

> intervals(orthf1.lme)

Approximate 95% confidence intervals

Fixed effects:

lower est. upper

(Intercept) 15.62353 17.37273 19.12193

age 0.37242 0.47955 0.58667

attr(,"label")

[1] "Fixed effects:"

Random Effects:

Level: mySubj

lower est. upper

sd((Intercept)) 1.3138 2.0685 3.2567

Within-group standard error:

lower est. upper

0.61052 0.78003 0.99662

Examining residuals

See Pinheiro and Bates (2000), Section 4.3.Our models assume

• within-group errors are i.i.d. and normally distributed, withmean zero and variance σ2

• the random effects are normally distributed with mean zeroand some covariance matrix.

Residuals and plots

Consider the model


random = ~ 1 | mySubj, data = orthf)


The within-group residuals are defined by

eij = yij − (β1 + bj + β2xij)

and can be obtained by resid(orthf1.lme). These areunstandardised. To get standardised residuals, use

resid(orthf1.lme, type = "pearson")

The type argument can be abbreviated to type = "p".

## resid(.) means the residuals of orthf1.lme

plot(orthf1.lme, mySubj ~ resid(.), abline = 0, ylab = "Subject")

Residuals

Sub

ject

F10

F06

F09

F01

F05

F02

F07

F08

F03

F04

F11

−1.5 −1.0 −0.5 0.0 0.5 1.0

●

●

●

●

●

●

●

●

●

●

●

qqnorm(orthf1.lme, ~ resid(., type = "p"))

Standardized residuals

Qua

ntile

s of

sta

ndar

d no

rmal

−2

−1

0

1

2

−2 −1 0 1

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

Further examples to try (see Pinheiro and Bates (2000), Section4.3).

orth2.lme <- lme(distance ~ 1 + age*Sex,


data = Orthodont)

plot(orth2.lme, Subject ~ resid(.), abline = 0)

plot(orth2.lme, resid(., type = "p") ~ fitted(.) | Sex,

id = 0.05, adj = -0.3)

qqnorm(orth2.lme, ~ resid(., type = "p") | Sex)

qqnorm(orth2.lme, ~ ranef(.), id = 0.1)

pairs(orth2.lme, ~ ranef(.) | Sex,

id = ~ Subject == "M13", adj = -0.3)

Also see Pinheiro and Bates (2000), Section 4.3, for a model wherethe within-subject variance differs between Males and Females.

3. Inferring the parameters of a hierarchicallinear model:

a Bayesian approach

Bayesian models

• In our models so far, the unknown parametersβ1, β2, σ

2, σ2b , . . . , have been treated as fixed unknown

constants – the classical/frequentist approach.

• Recall that in Bayesian statistics we represent our uncertaintyabout a parameter θ (which in general is a vector) by giving θa probability distribution (i.e. we treat θ as a random variable).

Bayesian models

We have

• a model p(y | θ) for the data, conditional on the parameterbeing θ

• and a prior distribution p(θ) for the parameter θ.

Then the posterior distribution p(θ | y) for θ given y is

p(θ | y) ∝ p(y | θ)p(θ)posterior ∝ likelihood × prior.

0 20 40 60 80 100

likelihoo

d

θ0 20 40 60 80 100

prior

θ0 20 40 60 80 100

posterior

θ

X =

Bayesian hierarchical models

The form of a Bayesian hierarchical model is as follows.

We have

• a model p(yij | θj) for the data in group j (forj = 1, . . . , J)

• a between-group model or “prior” p(θj |φ) forthe parameter θj (for j = 1, . . . , J)

• and a prior p(φ) for the hyperparameter φ.

φ

θj

yij

Example: schools

• Consider an educational study with data from students inmany schools

• We want to predict the students’ grades y on a standardizedtest given their scores on a pre-test x

y ∼ N (b1 + b2x, σ2)

• The regression model can depend on the school

yij ∼ N (bj1 + bj2xij , σ2)

where (bj1bj2

)∼ N

((β1

β2

),Ψ

)

Example: schools

We consider yij ∼ N (bj1 + bj2xij , σ2) where(

bj1bj2

)∼ N

((β1

β2

),Ψ

).

β1 β2 Ψ

(bj1, bj2)

σ2

yijxij

Bayesian parameter inference

• In Bayesian statistics, the information about a parameter φ issummarised in the posterior distribution

• A parameter value with a high posterior probability explainsmore the data than a parameter value with a lower posteriorprobability.

• The posterior probability distribution is typically not availablein closed form.

• There is therefore a need to use computer simulation andMonte-Carlo techniques to sample from the posteriordistribution.

Markov Chain Monte-Carlo algorithmIdea: construct a Markov chain with stationary distributionp(φ|x∗)Metropolis-Hasting algorithm:• start from an arbitrary point φ(0)

• for t = 1, · · ·• generate φ ∼ q(φ|φ(t))• take

φ(t+1) =

{φ with probability p = min

(1, p(y|φ)p(φ)q(φ(t)|φ)

p(y|φ(t))p(φ(t))q(φ|φ(t))

)φ(t) with probability 1− p

φ

φ(0)

Posterior

MCMC and Gibbs Sampler

• We generate samples from the posteriordistribution of the model parameters usingMCMC (Markov Chain Monte Carlo) simulation.Much more on MCMC in Further StatisticalMethods.

• To do so, we need to be able to evaluate thelikelihood p(y|φ) for each value of φ

• However, this require solving the followingintegral:

p(y|φ) =

∫p(y|θ)p(θ|φ)dθ

• The Gibbs Sampler algorithm can be used insuch a case to approximate the posteriordistribution.

φ

θj

yj

In R: rjags, JAGS

• We use rjags, which depends on coda. Installing rjags willnormally install coda as well – use the Packages menu, oralternatively something like install.packages("rjags")

• You also need to install JAGS (Just Another Gibbs Sampler)which you can get fromhttp:

//sourceforge.net/projects/mcmc-jags/files/JAGS/3.x/

http://sourceforge.net/projects/mcmc-jags/files/JAGS/3.x/

http://sourceforge.net/projects/mcmc-jags/files/JAGS/3.x/

Radon data

Example from Gelman and Hill (2007).

• The entire dataset consists of radon measurements from over80,000 houses spread across approximately 3,000 counties inthe USA. As in Gelman and Hill (2007), Chapter 12, we willrestrict to estimating the distribution of radon levels withinthe 85 counties in Minnesota (919 measurements).

• Think of the measurements made in the same county as agroup.

### radon_setup.R from

### http://www.stat.columbia.edu/~gelman/arm/examples/radon/

## a copy is available on the HMs webpage

source("radon_setup.R")

> radondata <- data.frame(y = y, x = x, county = county)

> radondata

y x county

1 0.78846 1 1

2 0.78846 0 1

3 1.06471 0 1

4 0.00000 0 1

5 1.13140 0 2

....

917 1.60944 0 84

918 1.30833 0 85

919 1.06471 0 85

• y is the log of the radon measurement in a house

• x is the floor of this measurement (x = 0 for basement, x = 1for first floor)

• county is the county in which the measurement was taken.

At first we ignore x.

Models

Let the observations be indexed by i = 1, . . . , n (where n = 919)and let the counties be indexed by j = 1, . . . , J (where J = 85).Let j[i] be shorthand for county[i], so

• j[1] = · · · = j[4] = 1, j[5] = 2, . . .

• . . . , j[917] = 84, j[918] = j[919] = 85.

Consider the modelyi = βj[i] + εi

where βj ∼ N(µ, σ2b ) and εi ∼ N(0, σ2). This is the same model

as for the Rail data, just written in a slightly different way.

We have

• yi ∼ N(βj[i], σ2)

• where βj ∼ N(µ, σ2b )

• and where µ, σ2 and σ2b are fixed unknown parameters.

Following a Bayesian approach:

• We treat the unknown parameters as random variables.

• After defining the likelihood and the priors for the parameters,we use an MCMC algorithm (via JAGS) to sample from theappropriate posterior distribution.

• MCMC algorithms generate dependent samples from theposterior distribution, using the likelihood and priors.

The first step is to tell JAGS what the model is, using a model file.We create a text file called radon1.jags containing the following.

model {

for(i in 1:n) {

y[i] ~ dnorm(beta[county[i]], sigma^(-2))

}

for(j in 1:J) {

beta[j] ~ dnorm(mu, sigma.b^(-2))

}

mu ~ dnorm(0, 0.0001)

sigma ~ dunif(0, 100)

sigma.b ~ dunif(0, 100)

}

• Note: JAGS uses the precision, not the variance, whenspecifying a normal distribution (precision = 1/variance). Sothe precision of yi being σ−2 corresponds to a variance of σ2.

• The prior for µ is a normal with mean 0 and precision 0.0001,i.e. a variance of 1000. The priors for σ and σb are bothUniform(0, 100).

• R needs to be able to find radon1.jags. Make sure you startR in a folder where you put this file.

Continue setup.

library(rjags)

r1.inits <- list(beta = rnorm(J), mu = rnorm(1),

sigma = runif(1), sigma.b = runif(1))

r1.jags <- jags.model("radon1.jags", inits = r1.inits)

%## the following also works

%## r1.jags <- jags.model("radon1.jags")

Specify the variables to monitor and then run.

r1.vars <- c("beta", "mu", "sigma", "sigma.b")

r1.sim <- coda.samples(r1.jags, r1.vars, n.iter = 10000)

> summary(r1.sim)

Iterations = 1001:11000

Thinning interval = 1

Number of chains = 1

Sample size per chain = 10000

1. Empirical mean and standard deviation for each variable,

plus standard error of the mean:

Mean SD Naive SE Time-series SE

beta[1] 1.0595 0.25250 0.0025250 0.0022662

beta[2] 0.8891 0.10676 0.0010676 0.0012457

beta[3] 1.2298 0.26478 0.0026478 0.0025880

....

beta[85] 1.2828 0.27726 0.0027726 0.0027516

mu 1.3118 0.04996 0.0004996 0.0007692

sigma 0.7994 0.01953 0.0001953 0.0003554

sigma.b 0.3162 0.04904 0.0004904 0.0016397

2. Quantiles for each variable:

2.5% 25% 50% 75% 97.5%

beta[1] 0.5471 0.8922 1.0638 1.2306 1.5466

beta[2] 0.6813 0.8179 0.8890 0.9596 1.0981

beta[3] 0.6985 1.0596 1.2308 1.4071 1.7483

....

beta[85] 0.7448 1.0995 1.2792 1.4662 1.8331

mu 1.2137 1.2783 1.3119 1.3446 1.4117

sigma 0.7618 0.7860 0.7992 0.8123 0.8388

sigma.b 0.2245 0.2817 0.3149 0.3487 0.4146

Using summary(r1.sim[, c(1:3, 85:88)]) picks out just theparameters we have looked at above.These results are similar to those obtained from the classicalapproach using

r1.lme <- lme(y ~ 1, random = ~ 1 | county)

xyplot(r1.sim[, c(1, 86:88)])

In these trace plots we are looking to see if the Markov chain hasmixed well, i.e. to see if it appears that it has (approximately)reached its stationary distribution.

.index

0.0

0.5

1.0

1.5

2.0

0 2000 4000 6000 8000 10000

beta[1]

1.2

1.3

1.4

1.5

mu

0.75

0.80

0.85

sigma

0.2

0.3

0.4

0.5 sigma.b

densityplot(r1.sim[, c(1:3, 84:88)])

Den

sity

0.0

1.0

0.0 0.5 1.0 1.5 2.0

beta[1]

01

23

0.4 0.6 0.8 1.0 1.2

beta[2]

0.0

1.0

0.0 0.5 1.0 1.5 2.0 2.5

beta[3]

0.0

1.0

2.0

1.0 1.5 2.0

beta[84]

0.0

0.5

1.0

1.5

0.0 0.5 1.0 1.5 2.0 2.5

beta[85]

02

46

8

1.1 1.2 1.3 1.4 1.5

mu

05

1020

0.75 0.80 0.85

sigma

02

46

8

0.2 0.3 0.4 0.5

sigma.b

95% HPD intervals.

> HPDinterval(r1.sim)

[[1]]

lower upper

beta[1] 0.5712122 1.5625728

beta[2] 0.6789466 1.0955027

beta[3] 0.6786186 1.7246411

....

beta[84] 1.1372023 1.8465939

beta[85] 0.7216640 1.8096207

mu 1.2128849 1.4098202

sigma 0.7595627 0.8359080

sigma.b 0.2271229 0.4163161

....

Suppose we now want to incorporate the effect of the floor x onwhich the measurment was taken (x = 0 for basement, x = 1 forfirst floor).We make our Bayesian model

• yi ∼ N(βj[i] + αxi, σ2)

• where βj ∼ N(µ, σ2b )

• and where µ, α, σ2 and σ2b have prior distributions.

[Our notation differs from that of Gelman and Hill (2007): wherewe have βj they use αj , and where we have α they use β.]

We create a second file called radon2.jags as follows.

model {

for(i in 1:n) {

y[i] ~ dnorm(beta[county[i]] + alpha*x[i], sigma^(-2))

}

for(j in 1:J) {

beta[j] ~ dnorm(mu, sigma.b^(-2))

}

mu ~ dnorm(0, 0.0001)

alpha ~ dnorm (0, 0.0001)

sigma ~ dunif(0, 100)

sigma.b ~ dunif(0, 100)

}

r2.inits <-

list(beta = rnorm(J), alpha = rnorm(1), mu = rnorm(1),

sigma = runif(1), sigma.b = runif(1))

r2.jags <- jags.model("radon2.jags", inits = r2.inits)

r2.vars <- c("beta", "alpha", "mu", "sigma", "sigma.b")

r2.sim <- coda.samples(r2.jags, r2.vars, n.iter = 10000)

> summary(r2.sim[, c(1:3, 86:89)])

Iterations = 1001:11000

Thinning interval = 1

Number of chains = 1

Sample size per chain = 10000

1. Empirical mean and standard deviation for each variable,

plus standard error of the mean:

Mean SD Naive SE Time-series SE

alpha -0.6932 0.07158 0.0007158 0.0011100

beta[1] 1.1869 0.25584 0.0025584 0.0033428

beta[2] 0.9297 0.10115 0.0010115 0.0012007

beta[85] 1.3823 0.28355 0.0028355 0.0028297

mu 1.4607 0.05346 0.0005346 0.0009401

sigma 0.7568 0.01816 0.0001816 0.0002934

sigma.b 0.3358 0.04695 0.0004695 0.0012546

....

We can compare the Bayesian results above with the followingclassical model.

> r2.lme <- lme(y ~ 1 + x, random = ~ 1 | county)

> summary(r2.lme)

....

Random effects:

Formula: ~1 | county


StdDev: 0.3282224 0.7555892

Fixed effects: y ~ 1 + x


(Intercept) 1.4615979 0.05157622 833 28.338600 0

x -0.6929937 0.07043081 833 -9.839354 0

....

• The estimates and standard errors for the classical andBayesian models are similar: for both models the coefficient ofx is estimated as −0.7. Further, a 95% confidence interval,and also a 95% credible interval, are both estimated asapproximately (−0.83,−0.55).

• That is, the floor on which the radon measurement is madehas a significant effect on the measurement: in comparisonwith basement measurements, a measurement on the firstfloor is lower (on average, lower by 0.7).

Gelman and Hill (2007) avoid the term “random effects.” Theyprefer the term “modelled” for coefficients that are modelled usinga probability distribution (i.e. what we have been calling randomeffects), and “unmodelled” otherwise (i.e. what we have beencalling fixed effects). Their recommended strategy for multilevelmodelling is:

• start by fitting classical regressions (e.g. using lm), . . .

• then setup multilevel models and fit using lme, . . .

• then fit fully Bayesian multilevel models, . . .

• . . .

Getting as far as fitting Bayesian models means treating allcoefficients as being “modelled.”

4. Studying inferred Hierarchical LinearModels and cross-level interactions

• Between and within group variability

• Prediction

• Cross-level interactions

Hierarchical Linear Models

A varying-intercept, varying-slope linear model (with oneexplanatory variable) can be written

yij = β1 + bj + (β2 + cj) xij + εij

where

• yij is the i-th observation in the j-th group

• xij is the i-th explanatory variable in the j-th group

• β1, β2 are fixed effects

• (bj , cj) ∼ N(0,Ψ) are the random effects

• εij ∼ N(0, σ2)

• (bj , cj) and εij are mutually independent and independentacross groups

More general formula

Denote by:

• yj the vector of observations in group j

• β the vector of fixed parameters (with components β1, β2, ...)

• bj the vector of random effects for group j (with componentsb1j , b2j , ...)

• εj the vector of random errors for group j.

Then, any hierarchical linear model (with one level of grouping)can be written as

yj = Xjβ + Zjbj + εj

where the matrices Xj and Zj are the known fixed-effects andrandom-effects regressor matrices.

• The components of the vector εj are i.i.d.

εj ∼ N (0, σ2I) .

• The vectors bj are assumed to be independent and normallydistributed with mean 0:

bj ∼ N (0,Ψ)

• bj and εj are mutually independent and independent acrossgroups

Both frequentist and Bayesian approaches enable us to

• infer the parameters of the model

• obtain confidence interval/regions regarding each parameter

Example: language scores in elementary schools

Example from Snijders and Bosker (2012).

Data on 3758 students, age about 11 years, from 211 schools inthe Netherlands. The structure of the data is students withinclasses, where class sizes range from 4 to 34. The dataset containsone class per school, so the class and school level are the same.

Let y be a student’s score on a language test, and let IQ be ameasure of their verbal IQ. We consider the model

yij = β0 + β1IQij + bj + εij

where j indexes schools, and i indexes the students within aschool. Here β0, β1 are fixed parameters and, independently,bj ∼ N(0, σ2

b ) and εij ∼ N(0, σ2).

The IQ score has been centered so that its mean is 0 (approx).

Centering the data

• Important to centre the explanatory variables

• If we are interested in the deviation to the mean it may beinteresting to add the mean as a variable

• This next point will be covered in the last lecture as well as itslink to the between and within group regression coefficients.

In R

> nlschools <- read.table("mlbook2_r.dat", header = TRUE)

> names(nlschools)[c(3, 4, 5, 9, 10)] <-

c("langscore", "SES", "IQ", "SESsch.av", "IQsch.av")

> lang.lme <- lme(langscore ~ 1 + IQ, random = ~ 1 | schoolnr,

data = nlschools, method = "ML")

> summary(lang.lme)

....

Random effects:

Formula: ~1 | schoolnr


StdDev: 3.1377 6.3615

Fixed effects: langscore ~ 1 + IQ


(Intercept) 41.055 0.243451 3546 168.637 0

IQ 2.507 0.054396 3546 46.096 0

....

• The fitted regression line for school j is

yij = 41.06 + bj + 2.507IQij .

• bj ∼ N (0, σ2b ) are the school-dependent deviations, with

σb = 3.14.

• The scatter around the regression line for a particular studenthas standard deviation σ = 6.36.

Variance and covariance

The covariance between two students in the same class j is

Cov(yij , yi′j) = σ2b = 3.142

and the variance of the score for one student is

V ar(yij) = σ2b + σ2 = 3.142 + 6.362

In this varying intercept, fixed-slope model, the covariance andvariance do not depend of the IQ.

Covariance for varying-intercept, varying-slopemodel

Consider a varying-intercept, varying-slope model:

yij = β1 + bj + (β2 + cj)xij + εij .

The covariance between two individuals in the same group j is

Cov(yij , yi′j)

= Cov(β1 + bj + (β2 + cj)xij + εij , β1 + bj + (β2 + cj)xi′j + εi′j

)= Cov

(bj + cjxij + εij , bj + cjxi′j + εi′j

)= Ψ2

11 + Ψ12(xij + xi′j) + Ψ222xijxi′j

and the variance of one individual is

V ar(yij) = Ψ211 + 2Ψ12xij + Ψ2

22x2ij + σ2 .

Exercise: derive the covariance between two individuals in thesame group for a model with more than one explanatory variable.

Prediction in Hierarchical Models

Predictions in hierarchical models is more complicated than inclassical regression as we have to decide

• do we want the predicted value for an existing group?

• or do we want the fitted/predicted value for a new group?

In the school example:

• do we want to predict the score of a new student in anexisting school?

• or do we want to predict the score of a new student in a newschool?

Prediction in classical regression

In classical regression, e.g., linear regression

Y = Xβ + ε .

prediction is simple:

1 specify the predictor matrix X for a new set of observations

2 compute the linear predictor Xβ

3 simulate the predictive data by simulating independent normalerrors ε with mean 0 and standard deviation σ and computey = Xβ + ε

Prediction for a new observation in an existing group



If we want to predict the value for a new individual in an existinggroup, we need to take into account the estimated (or predicted)values bj and cj of the random effects.

Let x the explanatory variable for the new individual in group j,then the value for this new individual can be simulated from anormal distribution as follows

y ∼ N (β1 + bj + (β2 + cj)x, σ2)

Prediction for a new observation in a new group



If we want to predict the value for a new individual in a new group,the random effect should be set equal to zero.

Let x the explanatory variable for the new individual in a newgroup, then the value for this new individual can be simulated froma normal distribution as follows

y ∼ N (β1 + β2x, σ2)

Orthodont: simple linear growth curves

This is an example of growth curve data, or repeated measures orlongitudinal or panel data.

Here we think of the measurements made on the same subject as agroup.

For each individual j,

• xij is the age of the child at time i.

• yij is the measurement of the distance from the pituitarygland to the pterygomaxillary fissure of the child at time i.

We consider a varying-intercept fixed-slope model

yij = β1 + bj + β2xij

Fitted values for an existing group: the random effects should taketheir estimated values, so we want β1 + bj + β2xij , obtained by

> fitted(orthf1.lme, level = 1)

F01 F01 F01 F01 F02 F02 F02 F02

19.980 20.939 21.898 22.857 21.549 22.508 23.467 24.427

....

Fitted/predicted values for a new group: the random effects shouldbe set to zero, so we want β1 + β2xij , obtained by

> fitted(orthf1.lme, level = 0)

F01 F01 F01 F01 F02 F02 F02 F02

21.209 22.168 23.127 24.086 21.209 22.168 23.127 24.086

....

These are for the ages at which we observed F01, F02, ....

fitted(orthf1.lme, level = 0:1) gives both types.

Predicted values at other ages:

> neworthf <-

data.frame(mySubj = rep(c("F02", "F10"), each = 3),

age = rep(16:18, 2))

> neworthf

mySubj age

1 F02 16

2 F02 17

3 F02 18

4 F10 16

5 F10 17

6 F10 18

> predict(orthf1.lme, newdata = neworthf, level = 0:1)

mySubj predict.fixed predict.mySubj

1 F02 25.045 25.386

2 F02 25.525 25.865

3 F02 26.005 26.345

4 F10 25.045 21.040

5 F10 25.525 21.520

6 F10 26.005 21.999

Cross-level interaction effects

It is possible to add explanatory variables for each group. Forexample, we can consider a model of the form

yij = γ0j + γ1jxij + εij

where γ0j = β00 + β01zj + b0j

γ1j = β10 + β11zj + b1j .

Here xij is an individual-level explanatory variable, and zj is agroup-level explanatory variable. Substituting, we get

yij = β00 + β01zj + β10xij + β11zjxij

+ b0j + b1jxij + εij .

The term β11 zj xij is called the cross-level interaction effect.

Cross-level interaction effects

In this example

yij = β00 + β01zj + β10xij + β11zjxij

+ b0j + b1jxij + εij .

we have fixed-effects for x, z and their interaction, and (withineach group) a random intercept and a random slope for x.

Other combinations of fixed and random effects are possible. Forexample, we could have a random effect for z and for theinteraction between x and z.

Cross-level interaction effects: Orthodont example

So far we only focused on female patients.

We want to include the sex as an additional variable: for eachindividual j, zj denotes the gender of the child (male=0,female=1).

We extend the previous varying-intercept fixed-slope model

yij = β1 + bj + β2xij

by making the intercept and the slope depending on the sex of theindividual and adding a random effect in the slope. We have:

yij = β1 + b1j + β2zi + (β3 + b2j + β4zj)xij + εij

Given this model

yij = β1 + b1j + β2zi + (β3 + b2j + β4zj)xij + εij

we can make the following comments.

• The average distance for a boy of age a is β1 + β3a.

• The average distance for a girl of age a isβ1 + β2 + (β3 + β4)a.

• The growth rate for girl ∼ N (β3 + β4, σ22).

• The distance distribution for a girl at age a is a normal withmean β1 + β2 + (β3 + β4)a and varianceσ2

1 + 2σ12a+ a2σ22 + σ2.

A made-up medical study example

In a medical study, each patient receives one of two possibletreatments and undergoes a laboratory test every month for a year.

Consider a mixed-effects model in which the outcome of thelaboratory test – described by a real number – is modelled as alinear function of the duration of the treatment.

It is assumed that the regression coefficients can differ betweenpatients, and that the slope can depend upon the receivedtreatment.

Let

• yij=Laboratory test outcome when observation i is made onpatient j

• xij =Duration after treatment (in months) when observationi is made on patient j

• zj =Treatment received by patient j

for i = 1, . . . , 12 and j = 1, . . . , 50.Consider the multilevel model:

yij = β1 + b1j + (β2 + β3zj + b2j)xij + εij

where

β1, β2, β3 are fixed effects coefficients,

bij , b2,j are random effects for patient j,

and εij is the random error for the observation i on patient j.

The distributional assumptions are the following:

• b1j , b2j and εij are mutually independent and independentacross j

• εij follows a normal distribution with mean 0 and variance σ2,

• (b1j , b2j)′ follows a bivariate normal distribution with mean

(0, 0)′ and covariance matrix(σ2

1 σ12

σ12 σ22

)

What is the variance of the outcome of the laboratory testafter t months of treatments as a function of theparameters?For a patient j, the variance of the outcome of the test after tmonths of treatments is

V ar(yij |xij = t) = σ21 + 2σ12t+ σ2

2t2 + σ2 .

What is the correlation between the outcomes of thelaboratory test of a patient after t1 and after t2 months oftreatment?The covariance between the outcome of the test after t1 and t2months of treatment is:

Cov(yij , yi′j |xij = t1, xi′j = t2) = σ21 + σ12(t1 + t2) + σ2

2t1t2 .

Therefore, the correlation of the outcome of the test after t1 andt2 months of treatment for a given patient is:

σ21 + σ12(t1 + t2) + σ2

2t1t2√(σ2

1 + 2σ12t1 + σ22t

21 + σ2)(σ2

1 + 2σ12t2 + σ22t

22 + σ2)

The following table gives the results of the model fitted to thedata. The variable Duration is the treatment duration measured inmonths, and the variable Treatment is 0 (for the first treatment) or1 (for the second treatment).

Fixed effects Value StdError

(Intercept) 6.32 0.31Duration 4.23 0.33

Treatment x Duration 1.74 0.46

Random effects Std.Dev. Correl.

(Intercept) 1.82 (Intr)Duration 1.63 0.21Residual 2.05

What is the estimated average laboratory test outcome aftera year of treatment for patients having received the first(resp. second) treatment?Estimated average test outcome after a year of treatment forpatients having received treatment 1:

β1 + 12β2 = 57.08

Estimated average test outcome after a year of treatment forpatients having received treatment 2:

β1 + 12(β2 + β3) = 77.96

What is the estimated distribution of growth of the outcomeof the laboratory test per month for each treatment?

Growth of the outcome per month for treatment z follows anormal distribution with mean β2 + β3z and variance σ2

2.

For treatment 1, the mean of the normal distribution is 4.23whereas for treatment 2 it is 5.97; for both cases the variance is2.66.

What is the correlation between the outcomes of thelaboratory test after 6 and after 12 months of treatment?

The estimated covariance of the outcome of the test after 6 and12 months of treatment is

Cov(yij , yi′j |xij = 6, xi′j = 12) = σ21 + 18σ12 + 72σ2

2 = 198.39

In addition the variance of the outcome is

V ar(yij |xij) = σ21 + 2σ12xij + σ2

2x2ij + σ2

which is equal to 105.68 for xij = 6 and to 395.15 for xij = 12.Therefore the correlation of of the outcome of the test after 6 and12 months of treatment is

Cov(yij , yi′j |xij = 6, xi′j = 12)√V ar(yij |xij = 6)V ar(yi′j |xi′j = 12)

=198.39√

105.68 ∗ 395.15= 0.97

4. Summary and Further Properties ofHierarchical Models

Motivations for multi-level modelling

Assume you have a dataset that can be separated according togroups.

Hierarchical models or multi-level models can be useful for:

• Learning about treatment effects that vary

• Using all the data to perform inferences for groups with smallsample size

• Prediction when data vary with groups

• Analysis of structured data

• More efficient inference for regression parameters

• Including predictors at two different levels

• Getting the right standard error: accurately accounting foruncertainty in prediction and estimation

Fixed effects or random effects?

When should we incorporate random-effects??

See Snijders and Bosker (2012), Section 4.3.1.

• If groups are unique entities and inference should focus onthese groups, then Fixed.This is often the case with a small number of groups.

• If groups are regarded as sample from a (perhapshypothetical) population and inference should focus on thispopulation, then Random.This often is the case with a large number of groups.

• If group-level effects are to be tested, then Random.Reason: a fixed-effects model already explains all differencesbetween groups by the fixed (group) effects, so there is nounexplained between-group variability left that could beexplained by group-level variables.

• If group sizes are small and there are many groups (and theassumptions of the model are reasonable), then Randommakes better use of the data.

• If the researcher is interested only in within-group effects, andis suspicious about the model for between-group differences,then Fixed is more robust.

• If the assumption that group effects, etc., are normallydistributed is a very poor approximation, then Random maylead to unreliable results.

See also the short section “Linear mixed models” of the article:Bates, D. (2005) Fitting linear mixed models in R. R News,5(1):27–30.http://cran.r-project.org/doc/Rnews/Rnews_2005-1.pdf

http://cran.r-project.org/doc/Rnews/Rnews_2005-1.pdf

Hierarchical modelling as partial pooling

Gelman and Hill (2007) and Jackman (2009) discuss hierarchicalmodels as being a compromise between two alternative models:

• a no-pooling model, and

• a complete-pooling model.

Hierarchical models can be thought of as giving semi-pooling orpartial-pooling estimates that lie somewhere between theno-pooling and complete-pooling alternatives.

Radon data

Example from Gelman and Hill (2007). The radon measurementsin county j

yij ∼ N(αj + βxij , σ2)

whereαj ∼ N(µ, σ2

b ) .

The parameters µ, σ2, σ2b are unknown.

• We want to estimate the average log radon level αj for countyj (and the other parameters).

• We have nj observations available in county j (wheren1 = 4, . . . , n85 = 2).

No pooling – one explanatory variable

Classical:r1.unpooled <- lm (formula = y ~ - 1 + x + factor(county))

Bayesian: yi ∼ N(αj[i] + βxi, σ2y) with priors for the αj , β and σy

model {

for (i in 1:n){

y[i] ~ dnorm (y.hat[i], tau.y)

y.hat[i] <- a[county[i]] + b*x[i]

}

b ~ dnorm (0, .0001)

tau.y <- pow(sigma.y, -2)

sigma.y ~ dunif (0, 100)

for (j in 1:J){

a[j] ~ dnorm (0, .0001)

}

}

Complete pooling – one explanatory variable

Classical: r1.pooled <- lm(y ~ x)

Bayesian: yi ∼ N(α+ βxi, σ2y) with priors for α, β and σy.

model {

for (i in 1:n){

y[i] ~ dnorm (y.hat[i], tau.y)

y.hat[i] <- a + b*x[i]

}

a ~ dnorm (0, .0001)

b ~ dnorm (0, .0001)

tau.y <- pow(sigma.y, -2)

sigma.y ~ dunif (0, 100)

}

tau.y <- pow(sigma.y, -2) says τy = 1/σ2y , so τy is the

precision of the yi.

No pooling and complete pooling

• In complete pooling, we exclude the categorical variable fromthe model.

• Complete pooling ignores variation between counties.

• In no pooling, we estimate separate models within each group.

• No pooling analysis overfit the data within each county.

Partial pooling : compromise between both extremes.

Partial-pooling estimates

The multilevel estimate for county j is approximately a weightedaverage of

• the mean of the observations in county j, i.e. the “unpooledestimate” y.j

• and the mean over all counties, i.e. the “completely pooledestimate” yall

given by

αj ≈nj

σ2 y.j + 1σ2byall

nj

σ2 + 1σ2b

.

The above formula for αj is a weighted average:

• for counties with smallnj

σ2 , the weighting pulls the estimatetowards the overall mean yall

• while for counties with largenj

σ2 , the weighting pulls theestimate towards the group mean yj .

We remark that data from group j has an effect on the estimateαk for k 6= j (via its effect on µ and σ2

b ).

So there is a “sharing” or “borrowing” of information acrossgroups.

Within-group and between-group regressions

The within-group regression coefficient is the regression coefficientwithin each group, assumed to be the same across the groups.

The between-group regression coefficient is defined as theregression coefficient for the regression of the group means of Y onthe group means of X.

This distinction is essential to avoid so called ecological fallacies.

Ecological fallacies

The ecological fallacy occurs when you make conclusions aboutindividuals based only on analyses of group data.

Example: Score per school

• assume that you measured the score at an exam in variousschools and found that a particular school had the highestaverage score in the district.

• imagine you meet one of the kids from that school and youthink to yourself ”she must be very good at this exam !”

• Fallacy! Just because she comes from the school with thehighest average doesn’t mean that she automatically obtaineda high-scorer in the exam.

Example of model with ecological fallacy

Consider the model


with one continuous explanatory variable x.

This model does not take into account that the within-groupcoefficient may differ from the between-group coefficient.

• Within-group coefficient = expected difference in y betweentwo individuals, in the same group, who differ by 1 unit in xij .

• Between-group coefficient = expected difference in groupmean y.j between two groups which differ by 1 unit in x.j .

For model:yij = β1 + bj + β2xij + εij

• within group j, the non-random part of the regression isy = β1 + β2x and so the within-group coefficient is β2

• taking the average of yij over individuals in group j gives

y.j = β1 + bj + β2x.j + ε.j

and the non-random part of this regression is y = β1 + β2x,hence the between-group coefficient is also β2.

Remark: the result is the same if we add a random-effect in theslope.

If we add the group mean x.j as an explanatory variable, we obtain

yij = β1 + bj + β2xij + β3x.j + εij

and this allows within- and between-group coefficients to differ:

• within group j, the non-random part of the regression isy = (β1 + β3x.j) + β2x and so the within-group effect is β2

• taking the average of yij over individuals in group j gives

y.j = β1 + bj + (β2 + β3)x.j + ε.j

and the non-random part of this regression isfy = β1 + (β2 + β3)x, hence the between-group effect isβ2 + β3.

The difference between the two models can be tested by testing ifβ3 = 0.

Within- and between-group regressions for IQ

Example from Snijders and Bosker (2012), continued.We consider

yij = β1 + bj1 + (β2 + bj2)IQij + β3IQ.j + εij .

lang2.lme <- lme(langscore ~ 1 + IQ + IQsch.av,

random = ~ 1+IQ | schoolnr, data = nlschools)

The variable IQsch.av contains the average IQ score for eachchild’s school.

> summary(lang2.lme)

....

Random effects:

Formula: ~1 + IQ | schoolnr

StdDev Corr

(Intercept) 2.9964325 (Intr)

IQ 0.4465464 -0.628

Residual 6.2996705

Fixed effects: langscore ~ 1 + IQ + IQsch.av


(Intercept) 41.12752 0.23468112 3546 175.24852 0e+00

IQ 2.48020 0.06449699 3546 38.45450 0e+00

IQsch.av 1.03055 0.26334813 209 3.91326 1e-04

....

The within-group coefficient is 2.48 and the between-groupcoefficient is 2.48 + 1.03 = 3.51.

• A pupil with a higher IQ score obtains, on average, a higherlangscore: on average, 2.48 extra langscore points peradditional IQ point.

• In addition, a pupil with a given IQ obtains, on average, ahigher langscore if s/he is in a class (i.e. school) with ahigher average IQ: if the class average IQ increases by 1 point,then a pupil’s langscore increases by 1.03points (onaverage).

So the effect of the classroom average IQ is about half that of theindividual’s IQ. The within- and between-group regressions are thesame if β3 = 0.

If the within- and between-group coefficients differ, then instead ofusing xij and x.j it is often convenient to use xij − x.j and x.j . Sothen the model is

yij = γ1 + bj + γ2(xij − x.j) + γ3x.j + εij

= γ1 + bj + γ2xij + γ3x.j + εij (1) eq:gen2

where xij = xij − x.j and where γ1 = β1, γ2 = β2, γ3 = β2 + β3.

The form (1) is convenient as the within-group coefficient γ2 andthe between-group coefficient γ3 (and their standard errors) aregiven immediately once we have fitted (1).

References

Snijders and Bosker (2012) note that it is common for within- andbetween-group coefficients to differ, and that their interpretationsare different.

See Snijders and Bosker (2012), Sections 3.1, 4.6.

See also:Gelman, A., Shor, B., Bafumi, J. and Park, D. (2007) Rich state,poor state, red state, blue state: What’s the matter withConnecticut? Quarterly Journal of Political Science 2, 345–367.

Available from:http://www.stat.columbia.edu/~gelman/research/published/

[Rich states vote for the Democrats, but rich people voteRepublican.]

http://www.stat.columbia.edu/~gelman/research/published/

Further hierarchical models

So far, we have focused on hierarchical linear models with twolevels.

In real-life situtations, very often we need to consider

• more levels

• non-nested models

• non-linear models

A three-level model

Suppose we have data with the structure: students nested withinclasses, and classes nested within schools. That is, we have 3 levels(students, classes, schools).

This is the structure of dataset science in package DAAG. Theresponse variable like is a measure of a student’s attitude toscience on a scale from 1 (dislike) to 12 (like). Explanatoryvariables available include: sex, a factor with levels f, m; andPrivPub, a factor with levels private school, public school.

## may need to install DAAG first

library(DAAG)

science.lme <- lme(like ~ sex + PrivPub,

random = ~ 1 | school/class,

data = science, na.action = na.omit)

We read “random = ~ 1 | school/class” as “a randomintercept for each school, plus a random intercept for each classwithin each school.”

The package DAAG is associated with the book:Maindonald, J. and Braun, W. J. (2010) Data Analysis andGraphics Using R. Third Edition. Cambridge University Press.See Section 10.2 and alsohttp:

//maths-people.anu.edu.au/~johnm/r-book/xtras/mlm-lme.pdf

http://maths-people.anu.edu.au/~johnm/r-book/xtras/mlm-lme.pdf

http://maths-people.anu.edu.au/~johnm/r-book/xtras/mlm-lme.pdf

Non-nested grouping factors

In the above model, each student belongs to exactly one class, andeach class belongs to exactly one school – there is a nestingstructure.

When grouping factors are not nested, they are said to be“crossed”.

Package lme4 is more suited than nlme to handling non-nesteddata.http://cran.r-project.org/web/packages/lme4/index.html

http://cran.r-project.org/web/packages/lme4/index.html

Example of non-nested model

From Gellman et al.

Psychological experiment with two potentially interacting factors:

• psychological experiment on 40 pilots on flight simulators

• under J = 5 different treatment conditions

• coming from K = 8 different airports

For each pilot i, we denote by

• j[i] its treatment condition

• k[i] its associated airport

The response can be fitted to a non-nested model

yij ∼ N (µ+ γj[i] + δk[i], σ2) for i = 1 . . . 40

γj[i] ∼ N (0, σ2γ) for j = 1 . . . 5

δk[i] ∼ N (0, σ2δ ) for k = 1 . . . 8

The parameters γj and δk represent respectively the treatment andairport effect.

When we fit to the data, the estimate residual standard deviationsat the individual, treatment and airport levels are σ = 0.23,σγ = 0.04, and σδ = 0.32. Thus the variation among the airportsis very big but the treatments have almost no impact in thevariation.

More complex multilevel models

The models we have considered so far can be generalised in anumber of way, in particular within logistic regression orgeneralised linear models. Other extensions include:

• variance can vary, as parametric functions of input variablesand in a multilevel way by allowing different variances forgroups

• models with several factors can have many potentialinteractions, which themselves can be modelled in structuredway

• regression models can be set up for multivariate outcomes

• more complicated models are appropriate for spatial data ornetwork structure

Documents

Sarah Filippi University of Oxford Hilary Term 2016filippi/Teaching/HM2016/Lecture.pdf · Outline Hierarchical models: frequentist and Bayesian approaches. 1 Introduction of the concepts