Open and Save in Rhem.bredband.net/didrik71/mstat/M5_Linear_models_2nov.doc · Web viewIn this case the default in a multiple regression analysis is to test the main effect of nitrogen

Didrik Vanhoenacker, 2009

Mission 5. Linear models. (2 November)Learning goals

Be able to state hypotheses about the effects of your explanatory variables. Be able to formalize these hypotheses into formal linear models. Be able to make and interpret graphs with two explanatory variables. Be able to check model assumptions. Be able to test your data with an anova-table and to correctly interpret such an anova

table. Be able to test your specific hypotheses with model comparisons

Data collection Use the data you collected in Friday.

Mission Evaluate your Friday study and write a Mini-report with neat graphs and conclusions. First use only two explanatory variables, 1 continuous and 1 categorical! (That is if

you gave more.)

Hand in One Mini-report on your Friday study.

Mail it to me, [email protected], and to your feed-back pal.

New feedback pals will be:Send to the one that your arrow is pointing on.

Niklas Shayer Esther Wang Joana Yao Nizar Karin J Charles Peter Yong Nok Li Rushana Malin Tahir Niklas

1


What shall I do with the Mini report I get as a feedback pal? Get inspired!

– Oh, shit! Is it possible to pimp the graph in gold. How did you do that? Get enlightened!

– Aha! Maybe I should have made similar conclusions myself? Give feedback!

– Really neat report (especially the gold)! But is your conclusion really tight?

– We've got a full sheet of data, a half pack of manuals, it's dark, and we're wearing sunglasses.

– Hit it!

Decide which 2 explanatory variables you want to work with (1 cont + 1 cat)If you have more that is! You can analyse the others later, but do start with two. Choose the two based on e.g.,

balance (a tree species variable will be hard to analyse if it contains 57 oaks and 2 maples…)

variation (tree circumference will be a useless variable if all trees have 126 cm in circumference, except for one with 12 cm…)

interesting hypothesis sampling accuracy (if you think that one of your variable suffer from crappy

measurements, you might think that it is less interesting to test…)

N.B. The criteria above are for this exercise to be as meaningful and learningful as possible. It is NOT necessary good criteria for choosing variables in a "real" study.

Remove rare groups? You can use three groups (e.g., tree species) in a categorical variable; this is handled by the anova tool. However, if one tree species was only found four times or less you probably want to remove that tree species (in Excel).

Make a simple hypotheses graphBy hand on a paper, make a graph of how you think that your 2 chosen explanatory variables will affect your response variable.

Then try to formulate these hypotheses in words: "I expect Frog weights to … ".

2

http://www.youtube.com/watch?v=YHa_jqxnn4o%00%E5%A1%B9%EF%92%81%E1%B4%BB%E4%A1%BF%E2%B2%AF%E5%B6%82%E8%97%84%E6%8C%A7%00%00%EA%AE%A5


Open and attach your data set in R.To be sure you have done this correctly this time, check your variables.

names(yourdataset)

Your continuous variables should be numeric, while your categorical variables should be factors.

is.numeric(yourcontinuous)is.factor(yourcategorical)

Test this with all your variables. If you get FALSE you have most likely opened your data in a wrong manner. Then try again…

To check that your categorical variables contain all the groups they should, and only these groups.

levels(yourcategorical)

If you get more groups than you thought you had, some of the group names are probably miss-spelled. If so, correct in excel and reopen in R.

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""When you copy things around in word, and type new things, word may be evil! Sometimes he (she?) changes your quotation marks from straight to "smart" (= typographic). R wants straight quotation marks "" and will give error message with typographic. Stop word from messing with your code: Tools (=Verktyg) AutoCorrect Options (=Alternativ för autokorrigering) .Under the tab Autoformat as you type (=Autoformat vid inskrivning) uncheck "Straight quotation marks with Smart" (="Raka citattecken med Typografiska")."""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

Be careful with spelling and parenthesis!

Evaluate your data graphicallyMake an ancova type graph Check the Pimp your graph manual!

Does your graph support your hypotheses?

E.g., is it similar to your hand made hypotheses graph?

What patterns do you think there is in your data?

How confident are you that these patterns really reflect true relationships or differences?

3


library(car)You need the package car. If you haven't installed that package you must first do so, in the same way as you installed gdata or xlsReadWrite for reading excel files. Check in the Open and Save in R manual if you don't remember. If you don't know whether it is installed or not, try to load it. If it is NOT installed R will tell you "there is no package called 'car' ". When car is installed you load it with:

library(car)

Test if the model assumptions for parametric tests are metFirst specify the full model, i.e., the most complicated model. In this case it means a model where your response variable depends on both your categorical explanatory variable, your continuous explanatory variable and on there interaction. Like this (but of course change the y and x:es to your variables):

mod.full<-lm(y~xcont+xcat+xcont:xcat)

Now you can extract all the residuals and put them in a box:

myresiduals<-resid(mod.full)

To see whether your residuals are normally distributed, take a look at a histogram of them.hist(myresiduals)

This histogram should look "normal".

To see whether the variation is evenly distributed along your explanatory variables, plot them against their fitted values. The fitted values are the values that your response variable would have if all its variation was explained by your explanatory variables. That is the value on the regression line or on the group mean.

plot(myresiduals~fitted(mod.full))

The points in this graph should be equally dispersed along the fitted values. Be especially aware of shape like this:

In this case the variation increases with the fitted values. This is typical for multiplicative effects (percent effect) and you may consider testing a logged relationship instead.

4


The default R assumption plot is plot(model). Click on the graph window to get the four graphs.

plot(mod.full)

The first two are the most important (= the same as you've done above), the other checks for skewness and how much influence individual data points have on the model. You will not get a histogram, but a qq-plot. If your residuals are normally distributed they will form a straight line.

###### STOP ############################################################################################################################################ I really recommend inspecting assumption plots instead of using assumption tests. ####### But if you really want to test, here is some advice. ######################################################################################################

To test for normality, the formal way, use for example the Shapiro-Wilk test, but it is often consider as overly conservative, especially at large sample sizes.

shapiro.test(myresiduals)

You may also test for heteroscedasticity (= non-homogeneous variances) using a Non Constant Variance test (= Breusch-Pagan test) in the package car.

library(car)ncv(mod.full)

If you only have categorical explanatory variables you may also test for heteroscedasticity (= non-homogeneous variances) using a var.test (two groups only) or a bartlett.test. According to Crawley and the R help files however, the Fligner-Killeen test is probably better.

fligner.test(y~xcat)

Test your data with an Anova tableFirst specify the full model, i.e., the most complicated model. In this case it means a model where your response variable depends on both your categorical explanatory variable, your continuous explanatory variable and on there interaction. Like this (but of course change the y and x:es to your variables):

mod.full<-lm(y~xcont+xcat+xcont:xcat)

Then you use the Anova table function (in the car package) to test all variable effects. Remember large capital letter Anova. If you use the lower case small letter anova you will often get a slightly different result. I will talk about why tomorrow. But you would almost always want to use Anova, not anova.

library(car)Anova(mod.full)

5


Now examine the Anova table. Always interpret an Anova table from the bottom and level-wise upwards!

In this case with two main effects (variables) and their interaction it means:o Start with the interaction. Is the interaction significant?o If yes, then STOP; this is the result!

The effect of one explanatory variable depends on the other explanatory variable!

Go back to your graph and check how you best interpret that.o If no, it means that the interaction was not important. You can continue

upwards.o Now check both main effects. They are on the same interaction level (in this

case no interaction = main effects) so they are tested "at the same time" and should be evaluated at the same time. Are the main effects significant?

o If yes, then one or two of your explanatory variables did affect your response! Go back to your graph and check how you best interpret that.

o If no, then you were not able to nail what explains the variation in your response.

Try larger sample size next time (or other = better explanatory variables!).

Test your data with model comparisons – Specify the modelsFirst specify the models you want to compare. In this case with two explanatory variables, your models will be something like below. Check the powerpoint from the Friday lecture if you'd like. Your analyses today will be very similar to the one on how area and counties affected species richness in grassland sites.

mod.int<-lm(y~xcont+xcat+xcont:xcat)mod.both<-lm(y~xcont+xcat)mod.xcont<-lm(y~xcont)mod.xcat<-lm(y~xcat)mod.null<-lm(y~1)

The colon (:) specifies an interaction. The y~1 specifies a null model, i.e., we only use the total mean to explain the y values.

Compare your modelsThere are some rules for this:

You test what has been added.

anova(mod.xcont,mod.both)

tests xcat, because xcat is what has been added in mod.both compared to mod.xcont. Think like this: If you have the information on only your continuous variable (= mod.xcont) you will have some amount of residual variation (remember: the sum of the squared residual lines). How much is this residual variation reduced if you add information about your categorical variable (mod.xcont + xcat = mod.both)? Is the reduction significant? That is, are we significantly better at

6


explaining the response variable (the y values) if we know both xcont and xcat compared to knowing only xcont?

Russian doll principle. You can only compare two models where one is the same as the other except that one thing has been added. The larger, more complicated, model shall contain the other.

Therefore you can NOT compare mod.xcont with mod.xcat. One does NOT contain the other! They are not Russian dolls…

Start with the most complicated model, and reduce level-wise as much as possible.

Warning 1. Is there a problem with correlated explanatory variables?If you suspect that this is the case, plot the explanatory variables against each other. If it is two continuous explanatory variables you think is correlated just do a regression graph. But also categorical explanatory variables may be correlated to either a continuous explanatory variable or another categorical explanatory variable. Check this with stripcharts or barplots respectively. E.g., if all oaks are 250-350 cm in circumference, while all birches are 80-140 cm in circumference it will be impossible to say whether it is tree species or tree size that affects e.g., lichen size.

The formal way to test for collinearity (= that your explanatory variables are correlated) is to check that the "variance inflation factor" (= VIF) is below 10. This test function is also in the package car, so you have to load that if you haven't already. Use a model with all explanatory variables, but without interactions.

library(car) vif(mod.both)

You will get vif values for all your explanatory variables. If someof them are above 10, try to remove the variable with the highest vif value and test again.

Warning 2. Centre continuous variables if you want to get the estimated effect of both their main effects and their interaction![ This is advanced! ] Example scenario: Both soil nitrogen and soil moisture has an effect on plant growth, and the effect of soil nitrogen depends on soil moisture. So the effect of nitrogen will be different in very moist soils compared to in drier soils. In this case the default in a multiple regression analysis is to test the main effect of nitrogen at the intercept of soil moisture, i.e., when moisture is zero. However it is usually more interesting to know the main effect of soil water at the mean level of soil moisture. If soil moisture is a normally distributed variable, this is also where most soil sample will be! To achieve that you can centre your variables by subtracting their mean:

c.moisture<-moisture-mean(moisture)c.nitrogen<-nitrogen-mean(nitrogen)

Now the means will be zero which is the same as that the intercept will occur at the mean. And that means that the main effects will be evaluated at the means.

7

Documents

Open and Save in Rhem.bredband.net/didrik71/mstat/M5_Linear_models_2nov.doc · Web viewIn this case the default in a multiple regression analysis is to test the main effect of nitrogen