48
Author(s): Kerby Shedden, Ph.D., 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution Share Alike 3.0 License: http://creativecommons.org/licenses/by-sa/3.0/ We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it. The citation key on the following slide provides information about how you may share and adapt this material. Copyright holders of content included in this material should contact [email protected] with any questions, corrections, or clarification regarding the use of content. For more information about how to cite these materials visit http://open.umich.edu/privacy-and-terms-use. Any medical information in this material is intended to inform and educate and is not a tool for self-diagnosis or a replacement for medical evaluation, advice, diagnosis or treatment by a healthcare professional. Please speak to your physician if you have questions about your medical condition. Viewer discretion is advised: Some medical content is graphic and may not be suitable for all viewers. 1 / 48

terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

  • Upload
    ngodung

  • View
    217

  • Download
    2

Embed Size (px)

Citation preview

Page 1: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Author(s): Kerby Shedden, Ph.D., 2010

License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution Share Alike 3.0 License: http://creativecommons.org/licenses/by-sa/3.0/

We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it. The citation key on the following slide provides information about how you may share and adapt this material.

Copyright holders of content included in this material should contact [email protected] with any questions, corrections, or clarification regarding the use of content.

For more information about how to cite these materials visit http://open.umich.edu/privacy-and-terms-use.

Any medical information in this material is intended to inform and educate and is not a tool for self-diagnosis or a replacement for medical evaluation, advice, diagnosis or treatment by a healthcare professional. Please speak to your physician if you have questions about your medical condition.

Viewer discretion is advised: Some medical content is graphic and may not be suitable for all viewers.

1 / 48

Page 2: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Regression analysis with dependent data

Kerby Shedden

Department of Statistics, University of Michigan

December 8, 2015

2 / 48

Page 3: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Clustered data

Clustered data are sampled from a population that can be viewedas the union of a number of related subpopulations.

Write the data as

Yij ∈ R,Xij ∈ Rp,

where i indexes the subpopulation and j indexes the individuals inthe sample belonging to the i th subpopulation.

3 / 48

Page 4: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Clustered dataIt may happen that units from the same subpopulation are morealike than units from different subpopulations, i.e.

cor(Yij ,Yij ′) > cor(Yij ,Yi ′j ′).

Part of the within-cluster similarity may be explained by X , i.e.units from the same subpopulation may have similar X values,which leads to similar Y values. In this case,

cor(Yij ,Yij ′) > cor(Yij ,Yij ′ |Xij ,Xij ′).

Even after accounting for measured covariates, units in a clustermay still resemble each other more than units in different clusters:

cor(Yij ,Yij ′ |X ) > cor(Yij ,Yi ′j ′ |Xij ,Xi ′j ′).

4 / 48

Page 5: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Clustered data

A simple working model to account for this dependence is

Yij = θi + β′Xij .

The idea here is that β′Xij explains the variation in Y that isrelated to the measured covariates, and the θi explain variation inY that is related to the clustering.

This working model would be correct if there were q omittedvariables Zij`, ` = 1, . . . , q, that were constant for all units in thesame cluster (i.e. Zij` depends on ` and i , but not on j).

In that case, θi would stand in for the value of∑

` ψ`Zij` that wewould have obtained if the Zij` were observed.

5 / 48

Page 6: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Clustered data

As an alternate notation, we can vectorize the data to expressY ∈ Rn and X ∈ Rn×p+1, then write

Y =∑

j

θj Ij + X β,

where Ij ∈ Rn is the indicator of which subjects belong to cluster j .

If the observed covariates in X are related to the clustering (i.e. ifthe columns of X and the Ij are not orthogonal), then OLS

apportions the overlapping variance between β and θ.

6 / 48

Page 7: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Test score example

Suppose Yij is a reading test score for the j th student in the i th

classroom, out of a large number of classrooms that areconsidered. Suppose Xij is the income of a student’s family.

We might postulate as a population model

Yij = θi + β′Xij + εij ,

which can be fit as

Yij = θi + β′Xij .

7 / 48

Page 8: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Test score example

Ideally we would want the parameter estimates to reflect sources ofvariation as follows:

I “Direct effects” of parent income such as access to books, lifeexperiences, good health care, etc. should go entirely to β.

I Attributes of classrooms that are not related to parentincome, for example, the effect of an exceptionally good orbad teacher, should go entirely to the θi .

I Attributes of classrooms that are correlated with parentincome, such as teacher salary, training, and resources, will beapportioned by OLS between θi and β.

I Unique events affecting particular individuals, such as thesevere illness of the student or a family member, should goentirely to ε.

8 / 48

Page 9: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Other examples of clustered data

I Treatment outcomes for patients treated in various hospitals.

I Crime rates in police precincts distributed over a number oflarge cities (the precincts are the units and the cities are theclusters).

I Prices of stocks belonging to various business sectors.

I Surveys in which the data are collected following a clustersampling approach.

9 / 48

Page 10: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

What if we ignore the θi?

The θi are usually not of primary interest, but we should beconcerned that by failing to take account of the clustering, we mayincorrectly assess the relationship between Y and X .

If the θi are nonzero, but we fail to include them in the model, theworking model is misspecified.

Let X be the design matrix without intercept, and let Q be thematrix of cluster indicators (which includes the intercept):

Y =

Y11

Y12

Y21

Y22

Y23

· · ·

Q =

1 0 · · ·1 0 · · ·0 1 · · ·0 1 · · ·0 1 · · ·· · · · · · · · ·

X =

X111 X112 · · ·X121 X122 · · ·X211 X212 · · ·X221 X222 · · ·X231 X232 · · ·· · · · · · · · ·

10 / 48

Page 11: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

What if we ignore the θi?Let X = [1n|X ]. The estimate that results from regressing Y on Xis

E β∗ = (X ′X )−1X ′EY

= (X ′X )−1X ′(Xβ + Qθ).

where β∗ = (β0, β′)′.

For the bias of β for β to be zero, we need

E β∗ − β = (X ′X )−1X ′(Xβ + Qθ − X β) = 0,

where β0 = E β0 and β = (β0, β′)′. Thus we need

0 = X ′(Xβ + Qθ − X β) = X ′(Qθ − β0).

11 / 48

Page 12: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

What if we ignore the θi?

Let S = Qθ be the vector of cluster effects. Since the first columnof X consists of 1’s, we have that

S = β0.

For the other covariates, we have that

X ′j S = β0X′j 1n,

which implies that Xj and S have zero sample covariance.

This is unlikely in many studies, where people tend to cluster (inschools, hospitals, etc.) with other people having similar covariatelevels.

12 / 48

Page 13: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Fixed effects analysis

In a fixed effects approach, we model the θi as regressionparameters, by including additional columns in the design matrixwhose covariate levels are the cluster indicators.

As the sample size grows, in most applications the cluster size willstay constant (e.g. a primary school classroom might have up to30 students). Thus the number of clusters must grow, so thedimension of the parameter vector θ grows. This is not typical instandard regression modeling.

13 / 48

Page 14: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Cluster effect examples

What does the inclusion of fixed effects do to the parameterestimates β, which are often of primary interest?

The following slides show scatterplots of an outcome Y against ascalar covariate X , in a setting where there are three clusters(indicated by color).

Above each plot are the coefficient estimates, Z-scores, and r2

values for fits in which the cluster effects are either included (topline) or excluded (second line). These estimates are obtained fromthe following two working models:

Y = α + βX +∑

θj Ij Y = α + βX .

The Z-scores are β/SD(β) and the r2 values are cor(Y ,Y )2.

14 / 48

Page 15: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Cluster effect examples

Left: No effect of clusters or X ; X and clusters are independent;β ≈ 0 with and without cluster effects in model.

Right: An effect of X , but no cluster effects; X and clusters areindependent; β, Z , and r2 are similar with and without clustereffects in model.

X

Y

β1 =−0.07 Z=−0.49 r2 =0.02

β1s=−0.06 Zs =−0.44 r 2s =0.00

X

Y

β1 =1.04 Z=7.75 r2 =0.57

β1s=1.01 Zs =7.81 r 2s =0.56

15 / 48

Page 16: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Cluster effect examples

Left: Cluster effects, but no effect of X ; X and clusters areindependent; β ≈ 0 with and without cluster effects in model.

Right: An effect of clusters but no effect of X ; X and clusters aredependent; when cluster effects are not modeled their effect ispicked up by X .

X

Y

β1 =−0.03 Z=−0.85 r2 =0.94

β1s=0.00 Zs =0.02 r 2s =0.00

X

Y

β1 =0.03 Z=0.19 r2 =0.94

β1s=0.92 Zs =20.72 r 2s =0.90

16 / 48

Page 17: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Cluster effect examples

Left: Effects for clusters and for X ; X and clusters are dependent.

Right: Effects for clusters but not for X ; X and clusters aredependent but the net cluster effect is not linearly related to X .

X

Y

β1 =0.90 Z=5.97 r2 =0.95

β1s=0.96 Zs =29.26 r 2s =0.95

X

Y

β1 =−0.04 Z=−0.31 r2 =0.84

β1s=0.00 Zs =0.02 r 2s =0.00

17 / 48

Page 18: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Cluster effect examples

Left: Effects for clusters and for X ; X and clusters are dependent;the X effect has the opposite sign as the net cluster effect.

Right: Effects for clusters and for X ; X and clusters are dependent;the signs and magnitudes of the X effects are cluster-specific.

X

Y

β1 =−1.93 Z=−12.68 r2 =0.99

β1s=2.72 Zs =16.81 r 2s =0.85

X

Y

β1 =−0.52 Z=−1.70 r2 =0.98

β1s=2.92 Zs =8.20 r 2s =0.58

18 / 48

Page 19: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Implications of fixed effects analysis for observational data

A stable confounder is a confounding factor that is approximatelyconstant within clusters. A stable confounder will become part ofthe net cluster effect.

If a stable confounder is correlated with an observed covariate X ,then this will create non-orthogonality between the cluster effectsand the effects of the observed covariates.

19 / 48

Page 20: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Implications of fixed effects analysis for experimental data

Experiments are often carried out on “batches” of objects(specimens, parts, etc.) in which uncontrolled factors causeelements of the same batch to be more similar than elements ofdifferent batches.

If treatments are assigned randomly within each batch, there areno stable confounders (in general there are no confounders inexperiments). Therefore the overall OLS estimate of β is unbiasedas long as the standard linear model assumptions hold.

20 / 48

Page 21: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Implications of fixed effects analysis for experimental data

Suppose the design is balanced (e.g. exactly half of each batch istreated and half is untreated). This is an orthogonal design, so theestimate based on the working model

Yij = βXij

and the estimate based on the working model

Yij = θi + β∗Xij

are identical (β = β∗). Thus they have the same variance. But theestimated variance of β will be greater than the estimated varianceof β∗ (since the corresponding estimate of σ2 is greater), so it willhave lower power and wider confidence intervals.

21 / 48

Page 22: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Random cluster effects

As we have seen, the cluster effects θi can be treated asunobserved constants, and estimated as we would estimate anyother regression coefficient.

An alternative way to handle cluster effects is to view the θi asunobserved (latent) random variables.

Now we have two random variables in the model: θi and εij . Theseare usually taken to be independent of each other.

If the θi are independent and identically distributed, we cancombine them with the error terms to get a single random errorterm per observation:

ε∗ij = θi + εij .

22 / 48

Page 23: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Random cluster effects

Let Yi• = (Yi1, . . . ,Yini)′ denote the vector of responses in the i th

cluster, let Xi• denote the matrix of predictor variables for the i th

cluster, and let εi• denote the vector of errors (ε∗ij) for the i th

cluster.

Thus we have the model

Yi• = Xi•β + εi•,

for clusters i = 1, . . . ,m. The εi• are taken to be uncorrelatedbetween clusters, i.e.

cov(εi•, εi ′•) = 0

for i 6= i ′.

23 / 48

Page 24: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Random cluster effects

The structure of the covariance matrix

Si ≡ cov(εi•|Xi•)

is

Si =

σ2 + σ2

θ σ2θ · · · · · · σ2

θ

σ2θ σ2 + σ2

θ σ2θ · · · · · ·

· · · · · · · · · · · · · · ·· · · · · · · · · · · · · · ·σ2θ σ2

θ · · · · · · σ2 + σ2θ

= σ2I +σ2θ11T ,

where var(εij) = σ2 and var(θi ) = σ2θ .

24 / 48

Page 25: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Generalized Least Squares

Suppose we have a linear model with mean structureE [Y |X ] = Xβ for Y ∈ Rn, X ∈ Rn×p+1, and β ∈ Rp+1, andvariance structure var(Y |X ) ∝ Σ, where Σ is a given n× n matrix.

We can factor the covariance matrix as Σ = GG ′, and consider thetransformed model

G−1Y = G−1Xβ + G−1ε.

Then letting η ≡ G−1ε, it follows that cov(η) = I , and note thatthe slope vector β of the transformed model is identical to theslope vector of the original model.

25 / 48

Page 26: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Generalized Least Squares

We can use GLS to analyze clustered data with random clustereffects.

Let Y ∗i• = S−1/2i Yi•, ε

∗i• = S

−1/2i εi•, and X ∗i• = S

−1/2i Xi•.

Let Y ∗, X ∗, and ε∗ be Y ∗i•, X ∗i•, and ε∗i•, respectively, stacked overi .

26 / 48

Page 27: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Generalized Least Squares

Since cov(ε∗i•) ∝ I , the OLS estimate of β for the model

Y ∗ = X ∗β + ε∗

is the best estimate of β that is linear in Y ∗ (by the Gauss-Markovtheorem).

Since the set of linear estimates based on Y ∗ is the same as the setof linear estimates based on Y , it follows that the GLS estimate ofβ based on Y is the best unbiased estimate of β that is linear in Y .

27 / 48

Page 28: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Generalized Least Squares

The covariance matrix of β in GLS is

(X ∗TX ∗)−1 = (XT Σ−1X )−1

Note that there is no σ2 term on the left because we havetransformed the error distribution to have covariance matrix I .

To apply GLS, it is only necessary to know Σ up to a multiplicativeconstant. The same estimated slopes β are obtained if wedecorrelate with Σ or with kΣ for k > 0. However we need toestimate σ2 and use

σ2(XT Σ−1X )−1

for inference when Σ is only correct up to a scalar multiple.

28 / 48

Page 29: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Generalized Least Squares

Since the sampling covariance matrix for GLS is proportional to(XT Σ−1X )−1, the parameter estimates are uncorrelated with eachother if and only if XT Σ−1X is diagonal – that is, if the columnsof X are mutually orthogonal with respect to the Mahalanobismetric defined by Σ.

This is related to the fact that Y in GLS is the projection of Yonto col(X ) in the Mahalanobis metric defined by Σ. Thisgeneralizes the fact that Y in OLS is the projection of Y ontocol(X ) in the Euclidean metric.

29 / 48

Page 30: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Least squares with mis-specified covariance

What if we perform GLS using a “working covariance” S that isincorrect?

For example, what happens if we use OLS when the actualcovariance matrix of the errors is Σ 6= I?

Since

E [ε∗|X ∗] = E [ε∗|X ] = 0,

the estimate β remains unbiased. However it has two problems: itmay not be the BLUE (i.e. it may not have the least varianceamong unbiased estimates), and the usual linear model inferenceprocedures will be wrong.

30 / 48

Page 31: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Least squares with mis-specified covariance

The sampling covariance when the error structure is mis-specifiedis given by the “sandwich expression:”

cov(β) = (X ∗′X ∗)−1X ∗′Σε∗X∗(X ∗′X ∗)−1

where

Σε∗ = cov(ε∗|X ∗).

This result covers two situations: (i) we use OLS, so X ∗ = X andΣε∗ = Σε is the covariance matrix of the errors, and (ii) we use a“working covariance” to define the decorrelating matrix G , andthis working covariance is not equal to Σ∗.

31 / 48

Page 32: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Iterated GLS

In practice, we will generally not know Σ, so it must be estimatedfrom the data. Σ is the covariance matrix of the errors, so we canestimate Σ from the residuals.

Typically a low-dimensional parameteric model Σ = Σ(α) is usedfor the error covariance.

For a given error model the fitting algorithm alternates betweenestimating α and estimating the regression slopes β.

32 / 48

Page 33: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Iterated GLS

For example, suppose our working covariance has the form

Si (j , j) = ν Si (j , k) = rν (j 6= k).

This is an exchangeable model with “intraclass correlationcoefficient” (ICC) r between two observations in the same cluster.

There are several closely-related ICC estimates. One approach isbased on the standardized residuals Rs

ij :∑i

∑j 6=j ′

RsijR

s ij ′/(∑

i

ni (ni − 1)/2− p − 1)

where ni is the size of the i th cluster.

33 / 48

Page 34: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Iterated GLS

Another common model for the error covariance is the first orderautoregressive (AR-1) model:

Σij = α|i−j |,

where |α| < 1 is a parameter.

There are several possible estimates of α borrowing ideas fromtime series analysis.

34 / 48

Page 35: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Generalized Least Squares and stable confounders

For GLS to give unbiased estimates of β (that also have minimumvariance due to the Gauss-Markov theorem), we must have

E [ε∗ij |X ] = E [θi + εij |X ] = 0.

Since E [εij |X ] = 0, this is equivalent to requiring that E [θi |X ] = 0.

Thus if the X covariates have distinct within-cluster mean values,and the within-cluster mean values of X are correlated with the θi ,then the GLS estimate of β will be biased.

35 / 48

Page 36: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Likelihood inference for the random intercepts model

The random intercepts model

Yij = θi + β′Xij + εij ,

can be analyzed using a likelihood approach, using:

I A random intercepts density

φ(θ|X )

I A density for the data given the random intercept:

f (Y |X , θ)

36 / 48

Page 37: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Likelihood inference for the random intercepts model

The marginal model for the observed data is obtained byintegrating out the random intercepts

f (Y |X ) =

∫f (Y |X , θ)φ(θ|X )dθ.

The parameters in f (Y |X ) are all parameters in f (Y |X , θ), and allparameters in φ(θ|X ).

Typically, f (Y |X , θ) is parameterized in terms of β and σ2, andφ(θ|X ) is parameterized in terms of a scale parameter σθ.

37 / 48

Page 38: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Likelihood inference for the random intercepts model

The first two moments of the marginal distribution Yij |X can becalculated explicitly. Let µθ ≡ E (θ|X ) and σ2

θ = var(θ|X ). Themoments are

E (Yij |X ) = µθ + β′Xij

var(Yij |X ) = σ2θ + σ2,

cov(Yij1 ,Yij2 |X ) = σ2θ (j1 6= j2).

38 / 48

Page 39: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Likelihood inference for the random intercepts model

Suppose

ε|X ∼ N(0, σ2) θ|X ∼ N(µθ, σ2θ).

In this case Y |X is Gaussian, with mean and variance as givenabove. Thus the random intercept model can be equivalentlywritten in marginal form as

Yij = µθ + β′Xij + ε∗ij

where ε∗ij = θi − µθ + εij .

It follows that E (ε∗ij |X ) = 0, var(ε∗ij |X ) = σ2θ + σ2, and the ε∗ij

values have correlation coefficient σ2θ/(σ2 + σ2

θ) within clusters.

39 / 48

Page 40: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Likelihood computation

Maximum likelihood estimates for the model can be calculatedusing a gradient-based optimization procedure, or the EMalgorithm.

Asymptotic standard errors can be obtained from the inverse ofFisher information, and likelihood ratio tests can be used tocompare nested models.

The parameter estimates from iterated GLS and the MLE for therandom intercepts model will be similar, and both are consistentunder similar conditions.

40 / 48

Page 41: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Predicting the random intercepts

Since the model is fit by optimizing the marginal distribution, weobtain an estimate of σ2

θ = var(θi ), but we don’t automaticallylearn anything about the individual θi .

If there is an interest in the individual θi values, we can predictthem using the best linear unbiased predictor (BLUP):

Eβ,σ2,σ2θ,µθ

(θ|Y ,X ) = µθ + cov(θ,Y )cov(Y |X )−1(Y − E [Y |X ])

41 / 48

Page 42: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Predicting the random intercepts

The estimated second moments needed to clculate the BLUP are:

cov(θi ,Yi ) = σ2θ · 1ni

and

cov(Yi |X ) = σ2I + σ2θ11T .

where ni is the size of the i th group, cov(θi ,Yj) = 0 for i 6= j , andcov(Y |X ) is block-diagonal with blocks given by cov(Yi |X ).

42 / 48

Page 43: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Predicting the random intercepts

Note that once the parameters are estimated, the BLUP for θi onlydepends on the data in cluster i :

Eβ,σ2,σ2θ,µθ

(θ|Y ,X ) = µθ + cov(θi ,Yi )cov(Yi |X )−1(Yi − E [Yi |X ])

cov(Yi |X )−1 = σ−2(I − σ2θ11T/(σ2 + ni σ

2θ))

For a given set of parameter values, the BLUP is a linear functionof the data. For the i th cluster,

Eβ,σ2,σ2θ,µθ

(θi |Y ,X ) = µθ +niσ2θ/(σ2 +ni σ

2θ) ·1T (Yi − E [Yi |X ])/ni ,

43 / 48

Page 44: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Predicting the random intercepts

In the BLUP for θi , this term is the mean of the residuals:

1T (Yi − E [Yi |X ])/ni

and it is shrunk by this factor:

niσ2θ/(σ2 + ni σ

2θ)

This shrinkage allows us to interpret the random intercepts modelas a “partially pooled” model that is intermediate between thefixed effects model and the model that completely ignores theclusters.

44 / 48

Page 45: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Predicting the random intercepts

In the fixed effects model, the parameter estimates θi are unbiased,but they are “overdispersed”, meaning that the sample variance ofthe θi will generally be greater than σ2

θ .

In the random intercepts model, the variance parameter σ2θ is

estimated with low bias, and the BLUP’s of the θi are shrunktoward zero.

45 / 48

Page 46: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Random slopes

Suppose we are interested in a measured covariate Z whose effectmay vary by cluster. We might start with the model

Yi = Xiβ + γZi + ε,

so if Zi ∈ {0, 1}ni is a vector of treatment indicators (1 for treatedsubjects, 0 for untreated subjects), then γ is the average changeassociated with being treated (the population treatment effect).

In many cases, it is reasonable to consider the possibility thatdifferent clusters may have different treatment effects – that is,different clusters have different γ values. In this case we can let γi

be the response for cluster i .

46 / 48

Page 47: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Random slopes

Suppose we model the random slopes γi as being Gaussian (givenX ) with expected value γ0 and variance σ2

γ . The marginal model is

E [Yi |Xi ,Zi ] = X ′i β + Ziγ0

and

cov[Yi |Xi ,Zi ] = σ2γZiZ

′i + σ2I .

Again we can form a BLUP Eβ,σ2γ σ

2 [γi |Y ,X ,Z ], and the BLUP is a

shrunken version of the fixed effects model in which we regress Yon X and Z within every cluster.

47 / 48

Page 48: terms of the Creative Commons Attribution Share Alike 3.0 ...dept.stat.lsa.umich.edu/~kshedden/Courses/Stat600/Notes/dependent... · terms of the Creative Commons Attribution Share

Linear mixed effects models

The random intercept and random slope model are special cases ofthe linear mixed model:

Y = Xβ + Zγ + ε

X is a n × p design matrix and Z is a n × q design matrix; γ is arandom vector with mean 0 and covariance matrix Ψ, ε is arandom vector with mean 0 and covariance matrix σ2I .

48 / 48