Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Heterogeneity and Hierarchical Data
Example from Squid
• From Zuur et al. Mixed Effects Models and Extensions in Ecology with R
• Testis weight was measured in squids of different sizes over multiple months
• Body size and month are potential explanatory variables
Squid Scatterplot
100 200 300 400 500
010
20
30
Dorsal Mantle Length (DML)
Testis W
eig
ht
100 200 300 400 500
010
20
30
Squid Residuals
0 5 10 15 20
-10
-50
510
15
Fitted values
Resid
uals
768
462287
1 2 3 4 5 6 7 8 9 11-5
05
10
15
Month
Resid
uals
100 200 300 400 500
-50
510
15
DML
Resid
uals
Squid Residuals 0 5 10 15 20
-10
-50
510
15
Fitted values
Resid
uals
768
462287
1 2 3 4 5 6 7 8 9 11
-50
510
15
Month
Resid
uals
100 200 300 400 500
-50
510
15
DML
Resid
uals
Problem and Solution
• Variance in the residuals increases with DML
• Variance in the residuals also varies with month
• Solution: Use a model that can accommodate a changing variance as a function of explanatory variables
Generalized Least Squares
• Generalized least squares models permit heterogeneity by explicitly modeling the change in variance as a function of an explanatory variable
• It’s essentially a linear regression, but with an added term (or function) for the variance
Modeling the Variance
• Fixed Variance Structure – Assumes that var(εi) = σ2 × DMLi
• Implementation > M.lm <- gls(Testisweight ~ DML*fmonth, data=squid)
> vf1Fixed <- varFixed(~DML)
> M.gls1 <- gls(Testisweight ~ DML*fmonth, weights=vf1Fixed, data=squid)
> anova(M.lm,M.gls1)
Linear regression model
Linear model with variance proportional to the DML variable Compare the models to
see which has lower AIC
Which Model is Better?
• The variance is allowed to vary with the value of DML
• In this case, the variance will be proportional to DML, so the varFixed model only works when the variance either increases or decreases as DML increases
Model df AIC BIC logLik M.lm 25 3752.084 3867.385 -1851.042 M.gls1 25 3620.898 3736.199 -1785.449
Modeling the Change in Variance Over Months
• Month is modeled as a nominal (categorical) variable
• varIdent allows each month to have its own variance > vf2 <- varIdent(form= ~1 | fmonth)
> M.gls2 <- gls(Testisweight ~ DML*fmonth, data=squid, weights=vf2)
> anova(M.lm,M.gls1,M.gls2)
> summary(M.gls2)
Results of anova()
Model df AIC BIC logLik Test L.Ratio p-value M.lm 1 25 3752.1 3867.4 -1851.0 M.gls1 2 25 3620.9 3736.2 -1785.4 M.gls2 3 36 3614.4 3780.5 -1771.2 2 vs 3 28.46 0.0027
df: Degrees of freedom, indicates complexity of the models – note that having a different variance for each level of increases the complexity quite a bit. AIC: Aikake Information Criterion, the lowest value indicates the best model. A difference of about 6 indicates that the better model is about 95% more likely. L.Ratio and p-value: A formal test for a significant difference between models. This test is only valid if the models are nested (that is, you can get the simpler model by setting a parameter in the more complex model to zero). AIC is valid whether or not the models are nested (but it isn’t a formal test).
Results of summary() Generalized least squares fit by REML Model: Testisweight ~ DML * fmonth Data: squid AIC BIC logLik 3614.436 3780.469 -1771.218 Variance function: Structure: Different standard deviations per stratum Formula: ~1 | fmonth Parameter estimates: 2 9 12 11 8 10 5 7 1.000 2.991 1.273 1.509 0.982 2.216 1.639 1.378 6 4 1 3 1.647 1.423 1.958 1.979
… plus a lot more. Shows the model, AIC, and the standard deviations estimated for each month.
Variance Structures
• varFixed: Variance changes proportionally to a continuous variable • varIdent: Variance is different for each level of a nominal variable • varPower: Variance is proportional to a continuous variable raised
to a power – εi ~ N(0, σ2 × |DMLi|
2δ)
• varExp: Exponential variance structure – Var(εi ) = σ2 × e2δ × DMLi
– Better when the covariate can take the value of zero
• varConstPower: Constant plus power of the variance covariate function – Var(εi) = σ2 × (δ1 + |DMLi|
δ)2
• varComb: Combination of variance structures – Var(εi) = σj
2 × e2δ × DMLij – Allows different levels for month and a change in variance with DML
Applying the varComb Variance Structure
#Allow variance to vary with month and DML > vf8 <- varComb(varIdent(form = ~1 | fmonth), varExp(form = ~DML)) > M.gls8 <- gls(Testisweight ~ DML * fmonth, weights=vf8, data=squid) > anova(M.lm, M.gls8) > vf9 <- varComb(varIdent(form = ~1 | fmonth), varPower(form = ~DML)) > M.gls9 <- gls(Testisweight ~ DML * fmonth, weights=vf9, data=squid) > anova(M.gls8, M.gls9) > plot(M.gls9)
Model df AIC BIC logLik M.gls8 1 37 3414.817 3585.463 -1670.409 M.gls9 2 37 3406.231 3576.877 -1666.116
NOTE: These models are NOT nested
Residuals of M.gls9
0 5 10 15 20
-10
-50
510
15
Fitted values
Resid
uals
768
462287
1 2 3 4 5 6 7 8 9 11
-50
510
15
Month
Resid
uals
100 200 300 400 500
-50
510
15
DML
Resid
uals
Fitted values
Sta
nd
ard
ize
d r
esid
ua
ls
-4
-2
0
2
4
0 5 10 15 20
Model with varComb: Original linear model:
In real life, you might want to examine this relationship for each value of the nominal variable.
How do I know which variance structure to use?
• If the variance covariate (i.e., the variable across which the variance changes) is nominal, use varIdent because it’s the only choice.
• In general, don’t use varFixed, because it’s inflexible (a strict linear relationship).
• Start with varPower, varExp, or varConstPower.
• Use varComb to combine different models for different variables.
• Use the distribution of residuals and AIC to pick the best model.
• If you have a biological reason to believe the variance should vary in a particular way, use that knowledge.
• If the values of the variance covariate are extremely large (100 or more), consider using a different scale (e.g., meters instead of mm) or standardizing them to avoid model instability.
Mixed Effects for Nested Data
• Example (from Zuur et al., Mixed Effects Models and Extensions in Ecology with R) – Species richness on a beach as a function of exposure and the
height of the sampling station compared to mean tidal level (NAP)
– Exposure is nominal and has two classes, NAP is continuous
– Each of nine beaches was sampled at five sites, so samples are nested within beaches
– The questions is whether species richness is affected by NAP and exposure (two factor anova, except for the hierarchical nature of the data)
Nested Data
• Obviously the sites within beaches are not independent
• One old-fashioned way to analyze the data would be to estimate the slope of species richness on NAP for each beach and then use the slopes in the analysis
• This approach discards much of the information by using a summary statistic for each beach and reduces the sample size to 9 (number of beaches)
The Data Sample Richness Exposure NAP Beach 1 1 11 10 0.045 1 2 2 10 10 -1.036 1 3 3 13 10 -1.336 1 4 4 11 10 0.616 1 5 5 10 10 -0.684 1 6 6 8 8 1.190 2 7 7 9 8 0.820 2 8 8 8 8 0.635 2 9 9 19 8 0.061 2 10 10 17 8 -1.334 2 11 11 6 11 -0.976 3 12 12 1 11 1.494 3 13 13 4 11 -0.201 3 14 14 3 11 -0.482 3 15 15 3 11 0.167 3
It looks like beach and exposure are confounded at first, but they aren’t because in the full dataset some different beaches have the same exposure (also 8 and 10 are both “low exposure” and 11 is “high exposure”).
Getting the Linear Regression Coefficients for Each Beach
• > Library(nlme)
• > by_beach <- lmList(Richness ~ NAP | Beach, data=rikz)
• This function estimates the linear regression for each value of Beach
• However, given the limitations of this approach, we probably want to use a model that makes better use of the data
Linear Mixed Effects Model
• Contains a mixture of fixed effects and random effects – Ri = Xi × β + Zi × bi + εi
– This is a system of equations for each response variable observation as a function of the value of NAP, and the relationship can vary among beaches.
• Assumptions: – bi ~ N(0, D)
– εi ~ N(0, Σi)
– b’s and ε’s independent
Steps to Select the Model
(1) Start with a model where the fixed component contains all explanatory variables and as many interactions as possible: the beyond optimal model.
(2) Find the optimal structure of the random component. Use REML estimation to compare nested models with different random components.
(3) Find the optimal fixed structure, keeping the optimal random structure you already found in (2). Nested models with different fixed effects should be compared using ML.
(4) Present the final model using REML estimation.
REML vs ML
• ML: Maximum Likelihood Estimation – Develops a probability equation for the probability of the
data given the model – Chooses the parameter values that maximize this
probability – The ML estimate is slightly biased due to the joint
estimation of multiple parameters
• REML: Restricted Maximum Likelihood – REML provides a solution to remove the bias
• When comparing two models, they need to both be fit using the same estimation procedure
Random Effects
• Random intercept model: – If beach is the random factor, then it allows the intercept
to vary among beaches (but the lines will be parallel)
> Mlme1 <- lme(Richness ~ NAP, random = ~1 | fbeach, data=rikz) #[the “1” represents the intercept]
> summary(Mlme1)
• Random intercept and slope model: – Allows the slope and intercept to vary among beaches
> Mlme2 <- lme(Richness ~ NAP, random = ~1 + NAP | fbeach, data=rikz)
> summary(Mlme2)
Random Intercepts
-1.0 -0.5 0.0 0.5 1.0 1.5 2.0
05
10
15
20
NAP
Ric
hn
ess
1
1
1
1
1
2
2
2
2
2
3
3
3
3 3
4
4 4
4
4
5
5
5
5
5
6
6
6
6
6
7
77
7
7
8
8
8
8
8
9
9
9
9
9
How to produce this graph
#Graph of random intercepts F0 <- fitted(Mlme1,level=0) F1 <- fitted(Mlme1,level=1) I <- order(rikz$NAP) NAPs <- sort(rikz$NAP) plot(NAPs, F0[I], lwd=4, type="l", ylim=c(0,22), ylab="Richness", xlab="NAP") for(i in 1:9) { x1 <- rikz$NAP[rikz$Beach==i] y1 <- F1[rikz$Beach == i] K <- order(x1) lines(sort(x1),y1[K]) } text(rikz$NAP, rikz$Richness, rikz$Beach, cex=0.9)
Random Intercepts and Slopes
-1.0 -0.5 0.0 0.5 1.0 1.5 2.0
05
10
15
20
NAP
Ric
hn
ess
1
1
1
1
1
2
2
2
2
2
3
3
3
3 3
4
4 4
4
4
5
5
5
5
5
6
6
6
6
6
7
77
7
7
8
8
8
8
8
9
9
9
9
9
Example 2: Putting it All Together
• We are going to follow the Zuur et al. 10 step plan (similar to the 4-step plan from earlier but with more steps)
• Sample data – Begging behavior of nestling barn owls – Response: “sibling negotiation”
• Number of calls in the 15 minutes preceding parent arrival divided by the number of nestlings
– Explanatory variables • Sex of the parent, food treatment (food satiated or food
deprived), arrival time of parent
Step 1: Linear Regression > M.lm <- lm(NegPerChick ~ SexParent * FoodTreatment + SexParent*ArrivalTime, data=owls) > plot(M.lm)
Step 1
• Evidence for heterogeneity
• Plot each explanatory variable against residuals to look for a pattern
• No pattern, so we can’t simply model the heterogeneity
• Instead, log transform in this case: log10(Y+1)
• lm and boxplot the log transformed data
Residuals After Transformation -3
-2-1
01
23
Auta
vauxT
VB
och
et
Cha
mpm
artin
ChE
sard
Che
vroux
Cor
celle
sFavr
es
Etrablo
z
Fore
l
Fra
nex
GD
LV
Gle
ttere
nsH
ennie
z
Jeus
sLes
Pla
nches
Luc
ens
Lul
ly
Marn
and
Mont
et
Muris
tO
leye
sP
aye
rne
Rue
yes
Seiry
Seva
z
StA
ubin
Tre
y
Yvo
nnand
Code for Previous Slide
#Log transform > owls$LogNeg <- log10(owls$NegPerChick+1) > M2.lm <- lm(LogNeg ~ SexParent * FoodTreatment + SexParent * ArrivalTime, data=owls) > E <- rstandard(M2.lm) > boxplot(E ~ Nest, data=owls, axes=FALSE, ylim=c(-3,3)) > abline(0,0); axis(2) > text(1:27, -2.5, levels(owls$Nest), cex=0.75, srt=65)
Step 2: Fit the model with GLS
• > library(nlme)
• > Form <- formula(LogNeg ~ SexParent*FoodTreatment + SexParent*ArrivalTime)
• > M.gls <- gls(Form, data=owls)
This model will give the same results as lm, but the model is fit with REML so that it can be compared to the mixed models
Step 3: Choose a Variance Structure
• Nest looked like it introduced variation into the analysis, because some were above and some below the line
• Thus, we want to include nest as a random factor
Step 4: Fit the Model
• > M1.lme <- lme(Form, random = ~ 1 | Nest, method = "REML", data=owls)
• Form is the same formula as before
• We tell lme to use “REML” so it will be comparable to the gls model
Step 5: Compare New Model with Old Model
• > anova(M.gls, M1.lme)
• The models were both fit with REML and are nested, so the Likelihood Ratio Test is valid
Model df AIC BIC logLik Test L.Ratio p-value M.gls 1 7 64.37422 95.07058 -25.18711 M1.lme 2 8 37.71547 72.79702 -10.85773 1 vs 2 28.65875 <.0001
Step 6: Validate the Model so Far
0.0 0.1 0.2 0.3 0.4 0.5 0.6
-2-1
01
23
Fitted values
Resid
uals
Female Male
-2-1
01
23
Sex of parent
Resid
uals
Deprived Satiated
-2-1
01
23
Food treatment
Resid
uals
22 24 26 28
-2-1
01
23
Arrival time
Time (hours)
Resid
uals
Code for the Previous Slide #Step 6 Model Validation E2 <- resid(M1.lme, type="normalized") F2 <- fitted(M1.lme) op <- par(mfrow = c(2,2), mar=c(4,4,3,2)) MyYlab <- "Residuals" plot (x=F2, y=E2, xlab="Fitted values", ylab=MyYlab) boxplot(E2 ~ SexParent, data=owls, main="Sex of parent", ylab=MyYlab) boxplot(E2 ~ FoodTreatment, data=owls, main="Food treatment", ylab=MyYlab) plot(x = owls$ArrivalTime, y=E2, ylab=MyYlab, main="Arrival time", xlab = "Time (hours)") par(op)
Steps 7 and 8: The Optimal Fixed Structure
• > summary(M1.lme) #NOT anova()
Fixed effects: list(Form) Value Std.Error DF t-value p-value (Intercept) 1.1236414 0.19522087 567 5.755744 0.0000 SexParentMale 0.1082138 0.25456854 567 0.425087 0.6709 FoodTreatmentSatiated -0.1818952 0.03062840 567 -5.938776 0.0000 ArrivalTime -0.0290079 0.00781832 567 -3.710251 0.0002 SexParentMale:FoodTrtSat 0.0140178 0.03971071 567 0.352998 0.7242 SexParentMale:ArrivalTm -0.0038358 0.01019764 567 -0.376144 0.7070
Consider dropping the least significant interaction term and refitting the model. Sequentially do this until only significant interactions remain
Step 7 and 8
• Best way to perform this step is probably via the likelihood ratio test
• Refit the model using “ML” but keep the same random structure
• Compare models with and without various terms until all terms are significant
Steps 7 and 8 #Use ML to refit the model and compare reduced models using LR tests M1.Full <- lme(Form, random = ~ 1 | Nest, method = "ML", data=owls) M1.A <- update(M1.Full, .~. -SexParent:FoodTreatment) M1.B <- update(M1.Full, .~. -SexParent:ArrivalTime) anova(M1.Full, M1.A) anova(M1.Full, M1.B)
> anova(M1.Full, M1.A) Model df AIC BIC logLik Test L.Ratio p-value M1.Full 1 8 -0.74 34.4 8.374 M1.A 2 7 -2.62 28.1 8.312 1 vs 2 0.123736 0.725 > anova(M1.Full, M1.B) p-value = 0.7102
M1.A had the highest p-value (of these two tests), so we drop the SexParent:FoodTreatment interaction first. Repeat until only significant effects are left – but remember that if an interaction is significant, you must keep the associated main effects
Step 9: Refit the reduced model with REML
• M5 <- lme(LogNeg ~ FoodTreatment + ArrivalTime, random = ~ 1 | Nest, method = "REML", data=owls)
• summary(M5)
Step 9 output
Linear mixed-effects model fit by REML Fixed effects: LogNeg ~ FoodTreatment + ArrivalTime Value Std.Error DF t-value p-value (Intercept) 1.1821386 0.12897491 570 9.165648 0 FoodTreatment -0.1750754 0.01996606 570 -8.768650 0 ArrivalTime -0.0310214 0.00511232 570 -6.067954 0 Correlation: (Intr) FdTrtS FoodTreatmentSatiated -0.112 ArrivalTime -0.984 0.039 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -2.22283609 -0.78307304 -0.07461892 0.68690000 3.29183331
Step 10
• Model validation and interpretation