Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
ANALYSIS OF VARIANCE AND
MODEL FITTING FOR R
C. Patrick Doncaster
http://www.soton.ac.uk/~cpd/
1
n
i
i
Y Y n
C. P. Doncaster
CONTENTS Page
Lecture: One-Way Analysis of Variance .......................................................................... 1
Comparison of parametric and non-parametric methods of analysing variance
What is parametric one-way Analysis of Variance (ANOVA)?
How to do a parametric one way Analysis of Variance
What are degrees of freedom?
Assumptions of parametric Analysis of Variance
Summary of parameters for estimating the population mean
Practical: Calculating One-Way Analysis of Variance ..................................................... 9
Lecture: Two-Way Analysis of Variance ........................................................................ 15
Example of two-way Analysis of Variance: cross-factored design
Using a statistical model to define the test hypothesis
Degrees of freedom
How to do a two-way Analysis of Variance
Using interaction plots
Lecture: Regression .......................................................................................................... 23
Comparison of Analysis of Variance and regression models
Degrees of freedom for regression
Calculation of the slope and intercept of the regression line
Practical: Two-Way Analysis of Variance in R ............................................................... 29
Lecture: Correlation and Transformations .................................................................... 31
The difference between correlation and regression, and testing for correlation
Transforming data to meet the assumptions of parametric Analysis of Variance
Lecture: Fitting Statistical Models to Data ..................................................................... 37
The three principal types of data and statistical models
1. One sample, one variable: G-test of goodness-of-fit
2. One sample, two variables:
(a) Categorical variables: G-test of contingency table
(b) Continuous variables: regression or correlation
3. One-way classification of two or more samples: Analysis of Variance
Supplementary information: Selecting and fitting models
1. One-way classification with two continuous variables: multiple regression
2. Two-way classification of samples: two-factor ANOVA or General Linear Model
Practical: Calculating Regression and Correlation ......................................................... 43
Appendix 1: Terminology of Analysis of Variance .............................................................. 45
Appendix 2: Self-test questions (1)......................................................................................... 49
Appendix 3: Sources of worked examples - ANOVA ........................................................... 51
Appendix 4: Procedural steps for Analysis of Variance ...................................................... 53
Appendix 5: Self-test questions (2)......................................................................................... 55
Appendix 6: Sources of worked examples - Regression ....................................................... 57
Appendix 7: Table of critical values of the F-distribution .................................................. 59
Lecture notes: One-way Analysis of Variance
C. P. Doncaster 1
LECTURE: ONE-WAY ANALYSIS OF VARIANCE
This booklet covers five lectures and three practicals. It is designed to help you:
1. Understand the principles and practise of Analysis of Variance, regression and correlation;
2. Appreciate their underlying assumptions, and how to meet them;
3. Learn the basics of using statistical models for quantitative solutions.
In meeting these objectives you will also become more familiar with the terminology of
parametric statistics, and this should help you use statistical packages and interpret their output,
and better understand published analyses.
Comparison of parametric and non-parametric methods
You have already been introduced to non-parametric tests earlier in this course. These are useful
because they tend to be robust - they give you a rough but reliable estimate and work well on data
which have an unknown underlying distribution. But often we can be confident about underlying
distributions, and then parametric statistics begin to show their strengths.
Some limitations of non-parametric statistics:
1. They test hypotheses, but do not always give estimates for parameters of interest;
2. They cannot test two-way interactions, or categorical combined with continuous effects;
3. They each work in different ways, with their own quirks and foibles and no grand scheme;
4. In situations of even moderate complexity such as you may encounter when doing research
projects, there may be no non-parametric statistic readily available.
Some advantages of parametric statistics:
1. They can be more powerful because they make use of actual data rather than ranks;
2. Parametric tests are very flexible, coping well with incomplete data and correlated effects;
3. They can test two-way interactions, and also categorical combined with continuous effects;
4. They are all built around a single theme, of Analysis of Variance. So there is a grand scheme,
a single framework for understanding and using them.
What is Analysis of Variance (ANOVA)?
Analysis of Variance is an extension of the Student’s t-test that you will already be familiar with.
A t-test can look for differences between the mean scores in two samples (e.g. body weights of
males and females). A one-way Analysis of Variance can look for an overall difference between
the mean scores in 2 or more samples of a factor (e.g. crop yield under three different treatments
of fertiliser). Later we will see how a two-way Analysis of Variance can further partition the
variance among two factors (e.g. crop yield under different combinations of pesticide as well as
fertiliser).
What does Analysis of Variance do? It analyses samples to test for evidence of a difference
between means in the sampled population. It does this by measuring the variation in a continuous
response variable (e.g. weight, yield etc) in terms of its sum of squared deviations from the
sample means. It then partitions this variation into explained and unexplained (residual)
components. Finally it compares these partitions to ask how many times more variation is
explained by differences between samples than by differences within samples.
Most ways of measuring variation would not allow partitioning, because the variation in the
components would not add up to the variation in the whole. We use ‘sums of squares’ because
they do have this property. We get the explained component of variation from the sum of squared
Lecture notes: One-way Analysis of Variance
C. P. Doncaster 2
deviations of sample means from the global mean. Then we get the unexplained component of
variation from the sum of squared deviations of variates from their sample means. These two
components together account for the total variation, which can be obtained from the sum of
squared deviations of variates from the global mean.
Let’s see how it works in practice. Say we have sampled a woodland population of wood mice,
and found the average weight of adult males is 25 g, and the average of adult females (not
gestating) is 17 g. But both sexes vary quite widely around these means, and some males are
lighter than some females. We want to know whether our samples just reflect random variation
within an undifferentiated population, or whether they illustrate a real difference in weight by
sex.
The problem is illustrated below with an ‘interval plot’ produced by R. It shows male and female
means and their 95% confidence intervals. This is a common way of summarising averages of a
continuous variable. The vertical lines cover the range of possible values for each population
mean, with 95% confidence. You will see how they are derived in the practical, but we use them
here to illustrate the extent of variation within each sample.
The confidence intervals overlap, reflecting the fact that some females were heavier than some
males. We do an Analysis of Variance to test whether the sexes are really likely to differ from
each other on average in the population, despite this overlap in the samples. This involves
comparing the two sources of variation in weight: (i) the average variation between means for
each sex (this is the variation explained by the factor ‘Sex’), and (ii) the average variation around
each sample mean (this is the residual, unexplained variation). Together they add up to the total
variation, when variation is measured as squared deviations from means.
Box 1. Partitioning the sums of squares (supplementary information)
Why do explained and unexplained sources of variation add up to the total variation, when
variation is measured as squared deviations from means?
For any one score, Y -G is its deviation from the grand mean. If we measure variation as squared
deviations, then the total variation in our two samples is the sum of squares: ( Y -G )2.
However, each Y -G comprises two components: Y -Y is the deviation of the score from the
Lecture notes: One-way Analysis of Variance
C. P. Doncaster 3
mean for its sample i and therefore the component not explained by the factor ‘sex,’ whileY -G
is the deviation of the sample mean from the grand mean and therefore the explained component.
For example, a score of 28g for a particular male is 3g away from his sample meanY = 25g,
which compares to the deviation of 4g by which the sample mean differs from the global
meanG = 21g (i.e. the mean of the means for each sex: (25+17)/2).
We can use a vector to describe the deviation of each score in terms of the two independent
sources of variation (explained and unexplained).
We plot these deviations of any one of the scores to its sample meanY, on an axis perpendicular
to the one describing the deviation of the global meanG from the sample meanY. This is
because these two deviations are independent by definition: the horizontal component in the
graph is explained by the factor sex, and the vertical component is unexplained, residual
deviation.
The total deviation is then the resultant vector, i.e. the bold arrow in the graph below resulting
from the combination of these two independent sources of variation.
Y G
Y
Y
Response (explained component)
Err
or
(un
ex
pla
ine
d c
om
po
ne
nt)
The squared length of this vector equals the sum of the squares of the other two sides (vertical
and horizontal arrows: Pythagoras’s theorem). So if we represent variation as squared deviations,
the variation for each score partitions into the two independent sources: the explained (Y -G )2,
and the unexplained ( Y -Y )2. We could attach such vectors to all our scores, and the sum of all
these increments then gives the total squared deviations in terms of the explained variation added
to the unexplained variation: ( Y -G )2 = (Y -G )2 + ( Y -Y )2.
If the average squared deviation ofG fromY is big compared to the average squared deviation
of Y fromY, then we could conclude that most of the total variation is explained by differences
between the sample means. This is exactly the procedure adopted by Analysis of Variance.
How to do a one-way Analysis of Variance
Let’s do this very simple Analysis of Variance on the two samples of adult wood mice. We want
to know if there is any difference between the body weights of males and females that cannot be
attributed to sampling error.
Design: Firstly it is very important to have designed a method of data collection that will allow a
sample to represent the population that we are interested in. Whatever the method, it must allow
subjects to be picked at random from the population. So if our male sample is going to comprise
5 individuals, they should not all be brothers, or all taken from the same patch of wood. [In the
practical you will look at an experimental analysis, of the effect of different pesticides on
hoverflies; you will then have experimental plots in place of individuals, and the important
Lecture notes: One-way Analysis of Variance
C. P. Doncaster 4
design consideration will be to allocate the different treatments (of pesticide) at random to the
experimental plots.]
Analysis: Having collected our samples, we then weigh all the males and all the females, and
calculate mean weights for each sample, and a grand (i.e. total or pooled) mean weight. These
data have been put into a spreadsheet, which is shown in Fig. 2 below. They will allow us to test
the null hypothesis, H0: There is no difference between the sample means.
Fig. 2. Data on body weights of male and female wood mice, as they look in an Excel spreadsheet.
Each score can now be tagged with the following information:
1. Its sample mean (column D);
2. The grand mean (col E);
3. The squared deviation of the sample mean from the grand mean (col F), which equals the
component of variation for this score that is explained by the independent variable ‘sex’;
4. The squared deviation of the score from the sample mean (col G), which equals the
component of unexplained variation for this score;
5. The squared deviation of the score from the grand mean (col H), which equals the component
of total variation.
Columns F, G, and H are then summed to find their ‘Sums of Squares’, which define the
variation from explained and unexplained sources, and the total variation:
We are interested in comparing the average explained variation with the average unexplained
(error) variation, and we get these averages from the ‘Mean Squares’:
These Mean Squares measure the explained and unexplained variances in terms of variability per
degree of freedom. Finally, the F-statistic is obtained from the ratio of these two Mean Squares:
Lecture notes: One-way Analysis of Variance
C. P. Doncaster 5
Interpretation: The F statistic is the ratio of average explained variation to average unexplained
variation, and a large ratio indicates that differences between the sample means account for much
of the variation of scores from the grand mean score. We can look up a level of significance in
tables of the F-statistic. In this example, for 1 and 8 degrees of freedom, the critical 5% value is
5.32. Since our calculated value exceeds this, we can draw the following conclusion: “Body
weights differ between males and females in the sampled population (F1,8 = 7.27, p < 0.05)”.
This is the standard way to present results of Analysis of Variance. Whenever presenting
statistical results, always give the degrees of freedom that were available for the test, so the
reader can know how big your samples were. For any Analysis of Variance this means giving
two sets of degrees of freedom.
What are degrees of freedom?
General rule: The F-ratio in an Analysis of Variance is always presented with two sets of degrees
of freedom. In a one-way test, the first corresponds to one less than the a samples or levels of the
explanatory variable (a - 1), and the second to the remaining error degrees of freedom (n - a).
For both sets, the degrees of freedom equals the number of bits of information that we have,
minus the number that we need in order to calculate variation. Think of degrees of freedom (d.f.)
as the numbers of pieces of information about the ‘noise’ from which an investigator wishes to
extract the ‘signal’. If you want to draw a straight line to represent a scatter of n points, you need
two pieces of information: slope and intercept, in order to define the line (i.e. you need n 2);
the scatter about the line (are all the points on it, or are they scattered or curved from it?) can then
be measured with the remaining n - 2 degrees of freedom. This is why the significance of a
regression is tested with a student’s t with n - 2 d.f. Likewise, when looking for a difference
between two samples, a Student’s t is tested with n - 2 d.f. because one d.f. is required to fix each
of the two sample means.
In Analysis of Variance, the first set of degrees of freedom refers to the explained component of
variation. This takes size a – 1, because we have a sample means and we need 1 grand mean to
calculate variation between these means. The second set of degrees of freedom refers to the
unexplained (error) variation. This takes size n – a, because we have n data points and we need a
sample means to calculate variation within samples.
Thus we calculate the average variance of sample means around the grand mean from the sum of
squared deviations ofY fromG, divided by one less than the a samples (= 1 for the wood mice).
Then we can deduce the average error variance from the sum of squared deviations of Y fromY,
divided by the remaining n - a degrees of freedom (= 8 in the wood mouse example).
Degrees of freedom are very important because they tell us how powerful our test is going to be.
Look at the table provided of critical values of F-distribution (p. 59). With few error d.f. (the
rows), the error variation needs to be many times smaller than variation between groups before
the ratio of to MS is big enough that we can be confident of a difference between
groups in the population from which we took samples for analysis.
This is particularly true when comparing between few samples. For example, if we want to
compare two samples each of 3 subjects, then the two sample means take 2 pieces of information
from the 6 subjects, leaving us with 4 error d.f. A significant difference at P < 0.05 then requires
that the average variation between samples is more than 7.71 times greater than the average
residual variation within each sample (as opposed to > 5.32 for the 2 samples of wood mice each
with 5 subjects: Appendix 7).
Lecture notes: One-way Analysis of Variance
C. P. Doncaster 6
Assumptions of Analysis of Variance:
The Analysis of Variance is run on samples taken from a population of interest, which means it
must assume: random sampling, independent residuals, normally distributed residuals, and
homogenous variances. We examine these 4 assumptions with a real example in the practical.
1. Random sampling is a design consideration for all parametric and non-parametric analyses. If
we had some a priori reason for wanting male mice to be heavier on average than females,
perhaps to bolster a favoured theory, then we might be tempted to choose larger males as
‘representatives’ of the male population. Clearly this is cheating, and only bolsters a circular
argument. Random sampling avoids this problem.
2. Independence is the assumption that the residuals (or ‘errors,’ the squared deviations of scores
from their sample means) should be independently distributed around sample means. In other
words, knowing how much one score deviates from its sample mean should not reveal anything
about how others do. Statistics only work by the accumulation of pieces of evidence about the
population, no one of which is convincing in itself. In combining these increments it is
obviously important to know that they are independent, and you are not repeatedly drawing on
the same information in different guises. This is true for both parametric and non-parametric
tests, and it is one of the biggest problems in statistical analysis for biologists.
If the wood mouse data came from sampling a wild population, some individuals may be caught
several times (if they get released back into the population after weighing). But clearly 5
measures repeated on the same individual do not provide the same amount of information as one
measure on each of 5 different individuals. This problem is called ‘pseudo-replication’ and leads
to the degrees of freedom being unjustly inflated. Analysis of variance can be conducted on
repeated measures, but it requires declaring ‘Individual’ as a second factor, and this adds extra
complications and assumptions - avoid it if at all possible!
Equally if most males came from one locality and most females from another, then we may be
seeing habitat differences not sex differences (i.e. the weights within each sample are not
independent, but depend on habitat). This problem is referred to as the ‘confounding’ of two
factors because their effects cannot be separated.
3. Homogeneity of variances is the assumption that all samples have the same variation about
their means, so the analysis can pertain just to finding differences between means. Violation of
this assumption is likely to obscure true differences. It can often be met by transforming the data
(see section on statistical modelling). See the practical exercise on page 14 for the R command to
perform a Bartlett’s test of homogeneity of variances.
4. Normality is the assumption that the residuals are normally distributed about their sample
means. We have seen how Analysis of Variance only makes use of two parameters to describe
each sample: the mean and the average squared deviations (the variance). A normal distribution
is a symmetrical distribution of frequencies defined by just these two parameters, so if the scores
are normally distributed around their sample means, then the data will be adequately represented
in the Analysis of Variance test. But if the distribution of scores is skewed, or bounded within
fixed limits (e.g. body weights can extend upwards any amount but cannot fall below zero), then
the mean may not represent the true central tendency in the data, and the squared deviations may
be an unreliable indicator of variance. In such cases, it is often necessary to transform the data
first (see pp. 34-35). See the practical exercise on page 14 for the R command to perform a
Shapiro-Wilk normality test on the residuals.
When using any statistic (parametric or non-parametric), you should do visual diagnostic tests to
check its assumptions. This applies also to Analysis of Variance, and in R you can do it with a
command of the sort: plot(aov(y ~ x)).
Lecture notes: One-way Analysis of Variance
C. P. Doncaster 7
Summary of parameters for estimating the population mean
Whenever you collect a sample of measurements, you will want to summarise its defining characteristics. If the data are approximately normally distributed around
some central tendency, and many types of biological data are, then three parametric statistics can provide much of the essential information. The sample mean,Y,
tells you what is the average measurement from your sample; the standard deviation (SD) tells you how much variation there is in the in the data around the sample
mean; the standard error (SE) indicates the uncertainty associated with viewing the sample mean as an estimate of the mean of the whole population, .
Parameter Description Example
1. Variable A property that varies in a measurable way between subjects
in a sample.
Weight of seeds of the Princess Bean Phaseolus vulgaris (in:
Samuels, M.L. 1991. Statistics for the Life Sciences. Macmillan).
2. Sample A collection of individual observations selected by a
specified procedure. In most cases the sample size is given
by the number of subjects (i.e. each is measured once only).
A sample of 25 Princess Bean
seeds, selected at random from the
total production of an arable field.
WEIGHT (mg)
343,755,431,480,516,469,69
4,659,441,562,597,502,612,
549,348,469,545,728,416,53
6,581,433,583,570,334
3. Sample mean
Y
The sum of all observations in the sample, divided by the
size of the sample, n. The sample mean is an estimate of the
population mean, (‘mu’) which is one of two parameters
defining the normal distribution (the other is , see below).
The sample meanY = nYn
i
i1
= 526.1 mg.
This comes from a population, the total production of the field,
which follows a normal distribution and has a mean = 500 mg.
4. Sum of squares,
SS The squared distance between each data point () and the
sample mean, summed for all n data points. The sample sums of squares SS
n
i
i YY1
2)(
5. Variance,
The variance in a normally distributed population is
described by the average of n squared deviations from the
mean. Variance usually refers to a sample, however, in which
case it is calculated as the sum of squares divided by n-1
rather than n.
The sample variance = 1n
SS
6. Sample standard deviation,
SD,
s
Describes the dispersion of data about the mean. It is equal to
the square root of the variance. For a large sample size,Y =
, and the standard deviation of the sample approaches the
population standard deviation, (‘sigma’). It is then a
property of the normal distribution that 95% of observations
will lie within 1.960 standard deviations of the mean, and
99% within 2.576.
The sample standard deviation s = (variance) = 113.7 mg.
The standard deviation of the population from which the sample
was drawn is = 120 mg.
Lecture notes: One-way Analysis of Variance
C. P. Doncaster 8
Parameter Description Example
7. Normal distribution A bell-shaped frequency distribution of a continuous
variable. The formula for the normal distribution contains
two parameters: the mean, giving its location, and the
standard deviation, giving the shape of the symmetrical
‘bell’. This distribution arises commonly in nature when
myriad independent forces, themselves subject to variation,
combine additively to produce a central tendency. Many
parametric statistics are based on the normal distribution
because of this, and also its property of describing both the
location (mean) and dispersion (standard deviation) of the
data. Since dispersion is measured in squared deviations
from the mean, it can be partitioned between sources,
permitting the testing of statistical models.
The weights of Princess Bean
seeds in the population follows a
normal distribution (shown in the
graph, with frequency on the
horizontal axis). Some 95% of the
seeds are within 1.96 standard
deviations of the mean, which is
1.96 = 500 235 mg.
8. Standard error of the mean,
SE
Describes the uncertainty, due to sampling error, in the mean
of the data. It is calculated by dividing the standard deviation
by the square root of the sample size (SD/n), and so it gets
smaller as the sample size gets bigger. In other words, with a
very large n, the sample mean approaches the population
mean. If random samples of n measurements were taken from
any population (not necessarily normal) with mean and
standard deviation , the mean of the sampling distribution
ofY would equal the population mean . Moreover, the
standard deviation of sample means around the population
mean would be given by /n.
The standard error of the mean n
SDSE = 22.74.
9. Confidence interval for Regardless of the underlying distribution of data, the sample
means from repeated random samples of size n would have a
distribution that approached normal for large n, with 95% of
sample means at ±1.960. With only one sample meanY
and standard error SE, these can nevertheless be taken as best
estimates of the parametric mean and standard deviation of
sample means. It is then possible to compute 95% confidence
limits for atY ±1.960SE (for large sample sizes). For small
sample sizes, The 95% confidence limits for are computed
at.
The 95% confidence intervals for from the sample of 25
Princess Bean seeds are at
[0.05]24Y t SE .
The sample is thus representative of the population mean, which
we happen to know is 500 mg. If we did not know this, the sample
would nevertheless lead us to accept a null hypothesis that the
population mean lies anywhere between 479.05 and 573.15 mg.
Practical: One-way Analysis of Variance
C. P. Doncaster 9
PRACTICAL : CALCULATING ONE-WAY ANALYSIS OF VARIANCE
Rationale
Analysis of variance is one of the most commonly used tests in biology, because biologists often
want to look for differences in mean responses between groups. Do male and female shrews
differ in body weight? Does crop yield differ with different concentrations of a fertiliser? Does
crop yield vary with rainfall? To find out whether shrews from a population of interest differ in
size between the sexes you could perform a t-test on samples from the population. This is a
simplified type of Analysis of Variance suitable for just two samples (males and females), and it
gives exactly the same statistical prediction. The Analysis of Variance comes into its own when
you are seeking differences between more than two samples. You would use Analysis of
Variance to find out if crop yield differs with three or more different concentrations of fertiliser.
You would also use the same method of Analysis of Variance to test the effect on crop yield of a
continuous variable such as rainfall, in which case you are testing whether rainfall has a linear
effect on yield (from a single sample rather than comparing between two or more samples).
In this practical you will perform an Analysis of Variance by hand, in order to see how it works.
This practical is designed to help you to interpret the output from statistical packages such as R,
which does most of the number crunching for you. Here is the scenario…
You have just graduated from University found employment with the Mambotox consultancy.
Mambotox is funded by outside contracts to evaluate the environmental impact of agricultural
chemicals. Its speciality is testing the effects of pesticides on non-target insects, spiders and
mites that are the natural enemies of crop pests (and hence useful to farmers as biological control
agents). Your first job with this company is to perform an experiment to compare the effects on
hoverflies of three new brands of pesticide designed to target aphids. Aphids are a major pest of
crops, but hoverflies are useful because their larvae are voracious predators of aphids. So an
efficient pesticide that also kills hoverflies may be no better in practise than a less efficient one
that does not.
To do the test you randomly allocate the three pesticides to plots of wheat which have all been
seeded with the same number of hoverfly larvae. After applying the treatments, you sample the
plots for surviving hoverfly larvae. You want to know whether the pesticide treatments influence
the survival of hoverfly larvae. This problem calls for an Analysis of Variance.
The hypothesis
Take a look at your data set at the top of page 13. It shows that each of the three treatments (Zap,
GoFly and Noxious) was applied to five replicate plots; the scores are the number of hoverfly
larvae counted in each replicate after treatment. The null hypothesis, H0, is that the mean scores
do not differ between treatments, i.e. that mean(Zap) = mean(GoFly) = mean(Noxious) in the
sampled population. The alternative hypothesis is that the population means are not all equal.
Analysis of Variance will allow you to test H0 and to decide whether it should be rejected in
favour of the alternative hypothesis.
Start to fill out the cells of the table beneath the data, by summing the scores for each
treatment and dividing each sum by its sample size to obtain the group means. That is what is
meant by the expression:
Practical: One-way Analysis of Variance
C. P. Doncaster 10
j
n
i
ijj nYYj
1
i.e. Group mean = sum of scores in group / number of scores in group
You can read the formula as follows: The mean (denotedY ) for each treatment j is equal to the
sum (‘‘) of i scores for that treatment (‘’) for i = 1 to , divided by , which is the sample
size (and for each of these treatments it equals 5 plots).
One of the means is rather larger than the others. How do we know if the differences between the
means are due to the pesticide treatments or because of random variation? It might be that
random differences between the 15 plots is enough to explain the higher mean value under one
treatment. This is precisely the null hypothesis that is tested by Analysis of Variance.
Analysing variance from the sums of squares
Analysis of Variance finds out what causes the individual scores to vary from the grand mean of
all the n = 15 plots. If you calculate this grand mean you should get a value of 9260/15 = 617.33.
None of the scores actually equals this grand mean, and their deviations from it are explained by
two possible sources of variation. The first source of variation is the pesticide treatment (Zap,
GoFly or Noxious). If Zap kills fewer hoverfly larvae, then we would expect plots treated with
Zap to have higher scores in general than plots treated with the other pesticides. The second
source of variation is due to differences among plots, which can be seen within each treatment.
The way we measure total variation for an Analysis of Variance is by summing up all the
squared differences from the grand mean. This is called the ‘total sum of squares’ or ‘SS’:
a
j
n
i
totalijtotal
j
YYSS1 1
2
The above expression means: SS is obtained by subtracting the grand mean
(denoted) from each score (‘’ denoting the ith score
in the jth treatment) and squaring this difference, then summing these squares for all scores in
each treatment and all a treatments. Do this, and keep a note of the value you get.
The reason for squaring each difference is that we can then separate this total variation into its
two sources: one due to differences between treatments (called the ‘sum of squares between
groups’, or ‘SS’), and one due to the normal variation between plots (the ‘error sum of
squares’, or ‘SS’). Then it is a very useful property of squared differences that:
SS = SS + SS.
Note that the word ‘error’ here does not mean ‘mistake’, but is a term describing the variation in
scores that we cannot attribute to a specific variable; you may also see it referred to as ‘residual’.
Calculate these sums of squares and put the values in the right-hand column of the table
below. Do this by first calculating the between group sums of squares for each treatment in turn:
21
2
)( totaljj
n
i
totaljjgroup YYnYYSSj
In other words, for each treatment j, square the difference between the group mean and the grand
mean and multiply by the sample size. Then add the three results together to get the overall
variation between group means: SS and put this value in the right-hand column. Now
calculate the error sums of squares for each treatment in turn:
Practical: One-way Analysis of Variance
C. P. Doncaster 11
jn
i
jijjerror YYSS1
2
)(
In other words, square the difference between each score and its group mean, and sum these
squares. Then add the three group sums to get the overall variation within groups: SS and put
this in the right-hand column. Finally, add SS to SS to get SS, and put it in the right-hand
column. Does this total equal the value that you obtained from the sum of all squared deviations
from the grand mean? It should, showing how total variance can be partitioned into its sources.
The F-value
It is intuitively reasonable to think that if we get a large variation between the group means
compared to variation within the groups, then the means could be considered to differ between
groups because of real differences between the pesticides (rather than because of residual
variation). This is the comparison that the F-value makes for us. It takes the average sum of
squares due to group differences (called the ‘group mean square’ or MS) and divides it by the
average sum of squares due to subject differences (the ‘error means square’ or MS):
anSS
aSS
MS
MSF
error
group
error
group
1 where a = number of groups, and n = total of 15 plots.
Calculate these mean squares, and add them into the right-hand column. Finally, calculate F.
This ratio will be large if the variation between the groups is large compared to the variation
within the groups. But the value of F will be close to unity for a true null hypothesis, of no
variation due to groups. Just how far above F = 1.00 is too much to be attributable to chance is a
rather complicated function of the number of groups and the number of plots in each group.
Tables of the F statistic will give us this probability based on the degrees of freedom for the
between group variation (a - 1 for a groups or treatments) and the degrees of freedom for the
within group variation (n - a ), or it will be provided automatically by statistical packages.
Use the published table provided for you in Appendix 7 to find the critical value for the upper
5% point of the F-distribution with the appropriate degrees of freedom (denoted v1 and v2 in the
table). The columns of the table give a range of possible degrees of freedom for the group mean
square, which is equal to a -1. The rows of the table give a range of possible degrees of freedom
for the error mean square, which is equal to n - a. Is your calculated value of F greater than this
critical value? If so, you can reject the null hypothesis with < 5% chance of making a mistake in
so doing. In the report of your analysis you would say “pesticide treatments do differ in their
effects on hoverfly numbers: = #.##, p < 0.05” substituting in the values of v1 and v2 and the
calculated F to 2 decimal places. Put this conclusion in the final row of your analysis.
Using a statistical package
Let’s compare the calculations you have been doing laboriously by hand with the output from
a statistical package. Read the same dataset into R, using the format shown on page 14. Now run
an Analysis of Variance in RStudio with the suite of commands on page 14. You should get the
same result as you got from the calculation by hand. Make sure you understand this output in
terms of the calculations you have been doing. When you use statistical packages such as R, you
will need to comprehend what the output is telling you, so that you can be sure it has done what
you wanted. For example, it is always a good idea to check that the output shows the correct
Practical: One-way Analysis of Variance
C. P. Doncaster 12
numbers of degrees of freedom. If it is not showing the degrees of freedom that you think it
should, then the package has probably tried to analyse your data in a different way from that
intended, so you would need to go back and check your input commands.
Having done the analysis in RStudio, you can now plot means and their confidence intervals
with two additional lines of R code, which call a script of plotting instructions and then run it:
source(file="http://www.southampton.ac.uk/~cpd/anovas/datasets/PlotMeans.R")
plot_means(aovdata$Trtmnt, aovdata$Score, "Treatment", "Score", "CI")
The 95% confidence intervals around the jth mean are at 1.96j j jY s n , where sj is the
sample standard deviation:
11
2
j
n
i
jijj nYYsj
The reason for this is that 95% of normally distributed data lie within 1.96 standard errors of the
mean, by definition, and the standard error is given by the term j js n . Which of the pesticides
can you recommend to farmers? The correct answer is none yet, until you have checked the
assumptions of the analysis.
Underlying assumptions of Analysis of Variance
Any conclusions that you draw from this analysis are based on four assumptions. What are
they? Refer back to page 6 if necessary.
1. The first assumption is that the plots are assigned treatments at random, which was indeed a
design consideration when you carried out the experiment.
2. The second assumption is that the residuals should be independently distributed, so they
succeed each other in a random sequence and knowing the value of one does not allow you to
predict the value of another (i.e. they truly represent unexplained variation). This is the
‘assumption of independence,’ which is a matter of declaring all known source of variation. In
this case, any variation not due to treatment contributes to the MSerror, and we assume it
contains no systematic variation (e.g., due to using different fields for different treatments).
The other assumptions concern the distribution of the error terms (residuals): . Use R to
test for these by using the commands on page 14.
3. The residuals should be identically distributed for each treatment, so all the groups have
similar variances. This is because the error mean square used to calculate F is obtained from
the pooled errors around each group mean. Since the analysis is only seeking differences
between means, it assumes all else is equal. This is the ‘assumption of homogeneity of
variances,’ which is visualised with the graph of residuals versus fitted values (funnel shaped
if heterogeneous), and also by the slope of a scale-location graph (non-zero if heterogeneous).
4. Finally, the should be normally distributed about the group means, because the sums of
squares that we use to calculate variance will only provide a true estimate of variance if these
residuals are normally distributed. This is the ‘assumption of normality,’ which is visualised
by the normal Q-Q plot. The plot should follow an approximately straight diagonal; bowing
indicates skew (to right if convex) and an S-shaped indicates a flatter than normal distribution.
There are various statistical methods of putting probability limits on the likelihood of your
residuals meeting each of these assumptions. We will not go into them here, but they are
described in any text book of statistics. Having visually checked the assumptions, which of the
pesticides can you recommend to farmers?
Practical: One-way Analysis of Variance
C. P. Doncaster 13
The data:
PESTICIDE
Zap GoFly Noxious
700 480 500
850 460 550
820 500 480
640 570 600
920 580 610
The Analysis of Variance:
Treatment group j
Zap GoFly Noxious Total
Sample sizes:
Sums of scores:
jn
i
ijY1
Means: j
n
i
ijj nYYj
1
totalY
a
j
totaljjgroup YYnSS1
2
+ + = d.f. =
a
j
n
i
jijerror
j
YYSS1 1
2
+ + = d.f. =
SS SS SStotal group error
MS SS agroup group 1
anSSMS errorerror
F MS MSgroup error
Fcrit[ . ]0 05
Conclusion:
Practical: One-way Analysis of Variance
C. P. Doncaster 14
Analysis of Variance in R
For this part, refer to the ‘Using RStudio – Help Guide’ on Blackboard. Type the
data into a new text file called ‘Score-by-pesticide.txt’, separating each score
from its treatment level by a tab. Then read this file into a ‘data frame’ in R and
perform the analysis in RStudio with the following suite of commands:
# 1. Prepare the data frame 'aovdata'
aovdata <- read.table("Score-by-pesticide.txt", header = T)
attach(aovdata) # Access the data frame
Trtmnt <- factor(Trtmnt) # Set Trtmnt as a factor
# 2. Command for factorial analysis
summary(aov(Score ~ Trtmnt)) # Run the ANOVA
bartlett.test(Score ~ Trtmnt) # Test for homogenous variances
shapiro.test(resid(aov(Score ~ Trtmnt))) # Test for normality
# 3. Plot data and residuals
par(cex = 1.3, las = 1) # Enlarge, orient plot labels
plot(Trtmnt, Score, xlab="Pesticide", ylab="Score") # Box plot
par(mfrow = c(2, 2)) ; plot(aov(Score ~ Trtmnt)) # 4 residual plots
par(mfrow = c(1, 1)) ; detach(aovdata) # Reset plot window; detach data frame
The ‘summary’ and ‘plot’ commands will give the following outputs:
Df Sum Sq Mean Sq F value Pr(>F)
Trtmnt 2 215613 107807 16.78 0.000334 ***
Residuals 12 77080 6423
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
From the ANOVA table, you conclude that the
treatment types differ in their effects on survival of
hoverfly larvae (F2,12 = 16.78, P < 0.001). The
ANOVA tells you nothing more than this. You then
interpret where the difference lies from the box plot
(showing median, first and third quartiles, and
max/min values up to ~2 s.d.; any outliers would be
plotted individually). The first two of four residuals
plots are shown below. Residuals versus fitted
(mean) response visualizes any heterogeneity of
variances. Residuals versus theoretical (normal)
quantiles visualises any systematic deviation from
normal expectation given by the diagonal line.
These plots show no detectable increase in heterogeneity with the mean (Bartlett’s K22 = 2.63, P
= 0.27, and no systematic deviation from normality (Shapiro-Wilk W = 0.96, P = 0.75).
Lecture notes: Two-way Analysis of Variance
C. P. Doncaster 15
LECTURE: TWO-WAY ANALYSIS OF VARIANCE
We have used one-way Analysis of Variance to test whether different treatments of a single
factor have an effect on a response variable (finding a treatment effect: F1,12 = 16.78, P < 0.001).
With two-way Analysis of Variance, we divide the samples in each treatment into sub-samples
each representing a different level of a second factor. A hypothetical example illustrates what the
analysis can reveal about the response variable.
Example of two-way Analysis of Variance: factorial design
In the following experiment, we wish to test the efficacy of different systems of speed reading,
and to know whether males and females respond differently to these systems. We randomly
assign 30 subjects (S1…S30) to three treatment groups: T1, T2 and T3, with 10 subjects per
treatment of which 5 are male and 5 female. The three groups are each tutored in a different
system of speed reading. A reading test is then given and the number of words per minute is
recorded for each subject. The data are presented in a design matrix like this:
Table 1. Design matrix for factorial Analysis of Variance.
SYSTEM
T1 T2 T3
SEX
Male Y1, ... Y5 Y11, ... Y15 Y21, ... Y25
Female Y6, ... Y10 Y16, ... Y20 Y26, ... Y30
The table thus has 6 data cells, each containing the responses of 5 independent subjects (here
coded Y1, ... Y5 etc). This is a ‘factorial design’ because these six cells represent all treatment
combinations of the two factors SEX and SYSTEM. Because each cell contains the same number
of responses, we call this a ‘balanced design,’ and because each level of one factor is measured
against each level of the other, it is also an ‘orthogonal’ design. [See page 31 for cross-factored
Analysis of Variance on unbalanced data.].
A two-way Analysis of Variance will give us three very useful pieces of information about the
effects of the two factors:
1. Whether mean reading speeds differ between the three techniques when responses of males
and females are pooled, indicated by a significant F for the SYSTEM main effect;
2. Whether males and females have different reading speeds when responses for the three
systems are pooled, indicated by a significant F for the SEX main effect;
3. Whether males and females respond differently to the techniques, indicated by a significant F
for the SEX:SYSTEM interaction effect.
We get these three values of F from five sources of variation: the n scores themselves, the a cell
meansY, the r row meansR, the c column meansC, and the single global meanG.
Lecture notes: Two-way Analysis of Variance
C. P. Doncaster 16
Table 2. Component means for the factorial design.
SYSTEM Row
T1 T2 T3 Means
Male Y Y Y R
Female Y Y Y R
Column
means C C C G
The R analysis of real data is shown below, producing the interaction plot above. The output
contains the three values of the F-statistic and their significance. The rest of this section is
devoted to explaining just how the means in the table above can lead us to the inferences in the
analysis below – that sex and system both have additive effects on reading speed, with no
interaction between them.
# Prepare data frame ‘aovdata’
aovdata<-read.table("System-by-sex.csv",sep=",",header=T)
attach(aovdata)
# Classify factors and covariates:
sex <- as.factor(sex) ; system <- as.factor(system)
# Specify the model structure:
summary(aov(speed ~ sex*system))
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 25404 25404 5.716 0.025 *
system 2 503215 251608 56.616 8.19e-10 ***
sex:system 2 2817 1408 0.317 0.731
Residuals 24 106659 4444
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
# Interaction plot:
interaction.plot(
sex, system, speed,
xlab = "Sex", ylab = "Speed", trace.label = "System",
las = 1, xtick = TRUE, cex.lab = 1.3
)
# Test for homogeneity of variances
bartlett.test(speed ~ interaction(sex, system))
Bartlett test of homogeneity of variances
data: speed by interaction(sex, system)
Bartlett's K-squared = 9.8486, df = 5, p-value = 0.07964
# Test for normality of residuals
shapiro.test(resid(aov(speed ~ sex*system)))
Shapiro-Wilk normality test
data: resid(aov(speed ~ sex * system))
W = 0.97261, p-value = 0.6127
detach(aovdata)
Lecture notes: Two-way Analysis of Variance
C. P. Doncaster 17
Using a statistical model to define the test hypothesis
In defining the remit of our analysis, we want to make a statement about the hypothesised
relationship of the effects to the response variable, and this can be done most concisely by
specifying a model. In the one-way Analysis of Variance that you conducted in the practical, you
tested the model:
HOVERFLIES = PESTICIDE +
The ‘=‘ does not signify a literal equality, but a statistical dependency. So the statistical analysis
tested the hypothesis that variation in the response variable on the left of the equals sign
(numbers of hoverflies) is explained or predicted by the factor on the right (pesticide treatments),
in addition to a component of random variation (the error term , ‘epsilon’). This error term
describes the residual variation between the plots within each treatment. We could have written it
out in full as ‘PLOTS(PESTICIDE)’ meaning the variation between the random plots nested
within the different types of pesticide (‘nested’ because each treatment has its own set of plots).
The Analysis of Variance tested whether much more of the variation in hoverfly numbers falls
between the categories of ‘Zap’, ‘GoFly’ and ‘Noxious’, and so is explained by the independent
variable PESTICIDE, than lies within each category as unexplained residual variation, =
PLOTS(PESTICIDE). This was accomplished by calculating the ratio:
Pesticide effect: )(' PESTICIDEPLOTS
PESTICIDE
error
group
MS
MS
MS
MSF
For our two-way experimental design, we can also partition the sources of variance. This time the
sources partition into two main effects plus an interaction, and the residual variation within each
sex and system combination. The full model statement looks like this:
SPEED = SEX + SYSTEM + SEX:SYSTEM + SUBJECTS(SEX:SYSTEM)
The four terms on the right of the equals sign describe all the sources of variance in the response
term on the left. The last term describes the error variation, . It is often not included in a model
description because it represents residual variation unexplained by the main effects and their
interaction. But it is always present in the model structure, as the source of random variation
against which to calibrate the variation explained by the main effects and interaction. With this
model, we can calculate three different F-ratios:
Sex effect: 1
'( : )
group SEX
error SUBJECTS SEX SYSTEM
MS MSF
MS MS
System effect: 2
'( : )
group SYSTEM
error SUBJECTS SEX SYSTEM
MS MSF
MS MS
Sex:System interaction effect: int :
'( : )
eraction SEX SYSTEM
error SUBJECTS SEX SYSTEM
MS MSF
MS MS
Degrees of freedom
Before attempting the analysis, we should check how many degrees of freedom there are for each
of the main effects and the interaction, and how many error degrees of freedom. Remember that
degrees of freedom are given by the number of pieces of information that we have on a response,
minus the number needed to calculate its variation.
The SEX main effect is tested with 1 degree of freedom (one less than its two levels: male and
female), and the SYSTEM main effect with 2 degrees of freedom (one less than its three levels);
Lecture notes: Two-way Analysis of Variance
C. P. Doncaster 18
the SEX:SYSTEM interaction effect is tested with the product of these two sets of degrees of
freedom (i.e. 1 2 = 2 degrees of freedom). The error degrees of freedom for both effects and the
interaction comprise one less than the remaining numbers in the total sample of N = 30, which is
30-(1+2+2)-1 = 24. You can also think of error degrees of freedom as being N – a, which is the
number of observations minus the a = 6 sample means needed to calculate their variation.
Thus the significance of the SEX effect is tested with a critical F1,24, SYSTEM with F2,24 and the
SEX:SYSTEM interaction with F2,24.
General rule: In general for an Analysis of Variance on n subjects (Y) measured against two
independent factors X1 (the row factor in a design matrix such as Table 1) and X2 (the column
factor), with r and c levels (samples) respectively, the model has the following degrees of
freedom:
model: Y = X1 + X2 + X1:X2 + Y(X1:X2)
d.f.: r-1 c-1 (r-1)(c-1) N-rc
The reason why the error degrees of freedom are rc less than N is simply because rc is equal to
one more than the sum of all the main effect and interaction degrees of freedom. Thus the four
sets of degrees of freedom all add up to a total of N - 1 degrees of freedom.
In practise when you design an experiment or fieldwork protocol that will require Analysis of
Variance, you can use this knowledge to work out in advance how many subjects you need. You
will need rc degrees of freedom (e.g. 2 levels of sex times 3 of system = 6) just to define the
group dimensions, and then at least the same again to give you enough error degrees of freedom
for a reasonably powerful test.
How to do a two-way Analysis of Variance
A two-way analysis comprises a test of the model as a whole, and a test of the individual terms in
the model. Its degrees of freedom and sums of squares follow the same principles as the one-way
Analysis of Variance. The ‘Quantities’ column shows how the component sums of squares relate
to each other (with n defining the number of replicates in each of the rc samples):
Table 3a. Calculation of degrees of freedom and sums of squares for the two-factor model.
Source of variation d.f. SS Quantities
2 Within cells (error) rc(n -1)
3 Total rcn -1
Table 3b. Calculation of degrees of freedom and sums of squares for the terms in the model.
Source of variation d.f. SS Quantities
5 Between columns (System)
6 Interaction (Sex:System)
7 Within cells (error) rc(n -1)
8 Total rcn -1
Lecture notes: Two-way Analysis of Variance
C. P. Doncaster 19
These sums of squares allow us to calculate mean squares, MS, for components 1 to 2 and 4 to 7,
by dividing each SS by its degrees of freedom. Finally, we get one F-statistic for each of
components 4, 5 and 6, by dividing the row MS by the (from component 7). These are the
mean squares and F-statistics shown in the R output pictured earlier.
You do not need to learn the formulae in the table above, but you should be able to gain from
them an appreciation of how the total sums of squares are partitioned into the different sources.
Interpreting the results
When we did one-way Analysis of Variance we obtained a single F-statistic on which to base our
conclusions about the hypothesised relationship. The two-way analysis, however, gives three
different values of F, each telling us about different aspects of the hypothesised relationship.
A significant SEX:SYSTEM interaction would allow us to conclude that the techniques have
different effects on males and females. In the particular example we have in Fig. 1, the
interaction term is not significant (F2,24 = 0.32, p > 0.7), meaning that the effect of reading
technique on speed is not modulated by (does not depend on) sex. In other words, reading
technique influences speed in the same way for males and females. That would be the conclusion
from the R analysis shown above.
A significant SEX effect (F1,24 = 5.72, p = 0.025 in Fig. 1) means that males and females have
different mean speeds, irrespective of technique.
A significant SYSTEM effect (F2,24 = 56.62, p < 0.001) means that reading technique does
influence mean speeds, irrespective of sex.
How do we interpret the analysis if one or other of the main effects is not significant? If the
interaction effect is significant, but the SYSTEM effect is not, what does this tell us about the
different reading techniques? In general, if an interaction term is significant, then both of the
component effects must also be significant, because each one influences the effect of the other on
the response variable. We should therefore always report a significant interaction first, before
considering the main effects. Some graphical illustrations will help to explain why this is.
Using interaction plots to help interpret two-way Analysis of Variance
Take a look at the set of eight graphs on the next page. These are called ‘interaction plots’ and
they illustrate all eight possible ways in which a response variable can depend on two factors.
The idea is to plot the response variable against one of the independent effects (it does not matter
which one) and then plot on the graph the sample means for each level of the other independent
effect. For the sake of clarity, means are plotted without error bars, and we can assume that each
would have only a small residual variation above and below it.
For each type of SYSTEM (T1, T2 and T3), the mean response is plotted for each type of SEX
(male or female), and joined by a line. Thus the mid-point of each of these lines reveals the mean
reading speed for systems T1, T2 and T3, irrespective of any sex effects. You can guess roughly
where the mean reading speed is for each sex from the average height of the three points at each
sex.
Lecture notes: Two-way Analysis of Variance
C. P. Doncaster 20
Fig. 2. Interaction plots for two independent effects, illustrating the eight possible outcomes of a two-way Analysis of Variance.
SEX
M F
T1
T2
T3
SEX
M F
T1
T2
T3
Significant SEX effect
No significant SYSTEM effect
No significant interaction
No significant SEX effect
Significant SYSTEM effect
No significant interaction
SP
EE
D
SP
EE
D
1. 2.
SEX
M F
T1
T2
T3
SEX
M F
T1
T2
T3
Significant SEX effect
Significant SYSTEM effect
Significant interaction
Significant SEX effect
Significant SYSTEM effect
No significant interactionS
PE
ED
SP
EE
D
4.3.
SEX
M F
SP
EE
D
T1
T2
T3
SEX
M F
T1
T2
T3
Significant SEX effect
No significant SYSTEM effect
Significant interaction
No significant SEX effect
Significant SYSTEM effect
Significant interaction
SP
EE
D
6.5.
SEX
M F
SP
EE
D
T1
T2
T3
SEX
M F
T1T2T3
No significant SEX effect
No significant SYSTEM effect
No significant interaction
No significant SEX effect
No significant SYSTEM effect
Significant interaction
SP
EE
D
8.7.
Lecture notes: Two-way Analysis of Variance
C. P. Doncaster 21
Graph 1 in Fig. 2 shows three systems that do not differ in their effects on reading speeds, but
females out-perform males on average.
Graph 2 shows males and females doing equally well, but subjects learning system T1
outperforming those learning system T2 who do better than those learning system T3.
Graph 3 shows the same differences between systems, but females also doing better on
average than males under any of the systems. This is the result we actually obtained.
Graph 4 shows what a significant interaction effect looks like. The effects of system depend
on sex, with differences between the methods having a more pronounced effect on female
reading speeds than those of males. In other words, the system effect is modulated by sex (or
equally, the sex effect is modulated by system).
Graph 5 shows males and females with the same average reading speeds (as in graph 2), but
the system effect depends very much on sex, with T3 being best for males and T1 for females.
In graph 6, females do better than males on average. The mid-points of the lines all coincide at
the same score for the response variable, and so no differences are apparent between the
systems if we pool males and females. But the type of reading system clearly does have an
important influence on males, and an equally important - but different - influence on females.
Thus the significant interaction indicates a real effect of system, even though it was not
significant as a main effect.
In graph 7, neither sex nor system are significant as main effects, but their combined effect is.
The effects of technique are apparent only when the sexes are considered separately.
In graph 8, speed is not influenced by sex or system, either independently or interactively.
Only under this outcome would the null hypothesis be accepted, that neither factor has an
influence on reading speed.
Other types of two-way Analysis of Variance
So far we have only considered factorial designs, which have replicates in all combinations of
levels of both factors. If a two-factored design has no replication within each cell, then it will
not be possible to look for interaction effects, and they must be assumed to be negligible. The
‘Latin square’ is an example of this (read more about it in Sokal & Rohlf). It is used in
situations where a single main effect is being tested (say 4 types of fertiliser on crop yield),
but in the presence of a second ‘nuisance’ effect (e.g. a gradient of moisture on the slope of a
hill). The best way to deal with this situation is to lay out the plots in a structured pattern
(rather than random allocation):
Hill top A B C D
B C D A
C D A B
Hill bottom D A B C
Thus each of 4 levels of height have each of the 4 types of fertiliser (A-D), so it is a fully
orthogonal design. The test model is: ‘Response = Factor + Block’ meaning that the response
(yield) is to be tested against a main factor (pesticide) and a blocking variable (moisture), with
an error mean square being provided by the unexplained interaction Factor:Block.
Many other designs are possible. You might read about nested analyses, or three-way or
higher order factorials, but when designing your own data collection, try to avoid the need for
these, because greater sophistication always requires more stringent conditions.
Lecture notes: Regression
C. P. Doncaster 23
LECTURE: REGRESSION
We have seen how Analysis of Variance gives us the capacity to test for differences between
category means. For example, are males heavier on average than females in the sampled
population? Here the response variable is weight and the categories are the two sexes. Sometimes
however we want to measure the response variable against a continuous, instead of a categorical,
variable. If we want to know whether Weight varies with Age, we could divide the observations
into age categories (e.g. ‘juvenile’ and ‘adult’) and do an ANOVA, or we could measure Weight
on a continuous scale with Age. In the latter case we are asking whether Weight regresses with
Age. Specifically, we hypothesise that Weight shows a linear relationship to Age (we will treat
non-linear relationships later). The statistical model is the same in both cases, and it is tested
with Analysis of Variance in both cases. Only the degrees of freedom are different:
Model for Analysis of Variance by categories: Weight = Age +
d.f. for n data points and a categories: a-1 n-a
Model for Analysis of Variance by regression: Weight = Age +
d.f. for n data points: 1 n-2
Both models could be analysed with the ‘aov’ command in R, though the first one would require
identifying Age as a ‘factor’ (with the command: Age <- as.factor(Age)). Whether you do
the regression analysis with the ‘aov’ command or the ‘lm’ command in R, the same Analysis of
Variance will be done for you, giving an F-statistic with 1 and n-2 degrees of freedom.
Where do these regression degrees of freedom come from? The value of F is calculated from
MS[Age] divided by MS[]. For MS[Age] we have 1 d.f. because we have two pieces of
information with which to construct our regression line - the intercept and slope - and we need
one piece of information - the overall mean weight - in order to calculate whether the regression
varies from horizontal. For MS[] we have n-2 degrees of freedom because we have n pieces of
information - the data points - and we need two pieces - the intercept and slope - in order to
calculate the residual variation, given by the squared deviation of each observation from the line.
Let’s see how this works with an actual example. The following page shows a data set on new-
born badger cubs. Body weights in grams at different ages in days have been typed into a text file
and the response Weight regressed against the predictor Age. The ‘lm’ command in R has done
an Analysis of Variance on the 12 data points, giving 1 and 10 d.f.. This Analysis of Variance
tests the compatibility of the data with a regression slope of zero (i.e., a horizontal regression) in
the population of interest. The result of F1,10 = 3.90, P = 0.076 tells us that we have too high a
probability of a false positive (P > 0.05) to reject the null hypothesis of zero slope, and therefore
that weight does not co-vary detectably with age. The plot shows data points with homogeneous
variance across the range of Age, no obvious deviations from normally distributed residuals
around the regression line, and a linear relationship. The 95% confidence intervals in the plot
show that the regression slope could swivel to horizontal without passing outside them –
confirming our lack of confidence in the sampled population having a relationship of Weight to
Age.
How does the analysis arrive at this result? Look now at page 25, which shows an Excel file into
which the data have been typed. Here we see how the F-value was calculated.
As with the Analysis of Variance for a class predictor variable, the Analysis of Variance for a
continuous predictor variable partitions the squared deviations of the response variable into two
independent parts. These are the explained (or ‘regression’), and the unexplained (or ‘residual
error’), sums of squares, which together add up to the total squared deviations of the response
variable from its mean value. The Table on page 26 summarises the operations.
Lecture notes: Regression
C. P. Doncaster 24
# Linear regression in R on response of Weight to Age
# 1. Prepare the data frame ‘aovdata’
aovdata <- read.table("Weight-by-age.txt", header = T)
attach(aovdata) # Access the data frame
Age <- as.numeric(Age) # Set Age as ‘numeric’
# 2. Commands for regression analysis
model.1.1i <- lm(Weight ~ Age) # Analyse and store
summary(model.1.1i) # Print the results
Call:
lm(formula = Weight ~ Age)
Residuals:
Min 1Q Median 3Q Max
-144.058 -89.751 7.117 68.571 174.375
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 420.58 67.85 6.198 0.000102 ***
Age -18.22 9.22 -1.976 0.076392 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Residual standard error: 110.2 on 10 degrees of freedom
Multiple R-squared: 0.2808, Adjusted R-squared: 0.2089
F-statistic: 3.904 on 1 and 10 DF, p-value: 0.07639
# 3. Plot the data
plot(Age, Weight,cex=1.5, las=1,
xlab="Age (days)", ylab="Weight (g)")
# 4. Add regression line and 95% confidence intervals
abline(coef(model.1.1i)) # add regression line
confint <- predict(model.1.1i, interval="confidence")
lines(Age, confint[,2], lty=2) # add lower c.i
lines(Age, confint[,3], lty=2) # add upper c.i
coef(model.1.1i) # print intercept and slope
(Intercept) Age
420.57576 -18.21678
# 5. Test assumptions
shapiro.test(resid(lm(Weight ~ Age))) # Normality of residuals
library(car);ncvTest(lm(Weight ~ Age))# Homogeneity of variance
Lecture notes: Regression
C. P. Doncaster 26
This is how the terms are calculated in the Excel sheet on the preceding page:
Order Term Derivation Meaning of symbols
1. SSx ( x -x )² The sum of squared deviations of x from its mean, where x is
Age (column B) andx is mean age (cell b18).
2. SS(Total)
[or ‘SSy’] ( y -y )² The sum of squared deviations of y from its mean, where y is
Weight (Column F) andy is mean weight (cell F18).
3. SPxy ( x -x )( y -y ) The ‘sum of products’ of the deviations of x with y. Dividing
this by (n –1) gives the ‘covariance’.
4. Slope: b SPxy
SSx
Gradient of the regression line. A horizontal line has b = 0. A
positive gradient has b > 0, while negative has b < 0.
5. Intercept: a y - b x Calculated knowing regression line passes through x,y.
6. SS(Explained) (ŷ -y )² Explained sum of squared deviations, where: ŷ = a + bx. This
is the magnitude of the predicted deviation fromy.
7. d.f.(Explained) 2 - 1 = 1 We have two pieces of information (a and b) and we need one
piece (y ) to calculate the explained variation.
8. MS(Explained) SS(Explained)
d.f.(Explained)
Mean Square explained variation. The variance measured as
variability per degree of freedom.
9. SS(Error) ( y - ŷ )² Unexplained sum of squared deviations where ŷ = a + bx. This
is the magnitude of deviation from the predicted ŷ.
10. d.f.(Error) n - 2 We have n pieces of information (the values of y) and we need
two pieces (a and b) to calculate the error variation.
11. MS(Error) SS(Error)
d.f.(Error)
Mean square unexplained (residual error) variation. The
variance measured as variability per degree of freedom.
12. F MS(Explained)
MS(Error)
The ratio of explained to unexplained variances, to be
compared against tables of the F-distribution with 1 and n-2
degrees of freedom.
13. R ² SS(Explained)
SS(Total)
‘Coefficient of determination’ (often written r²). The
proportion of explained variation. If R ² = 1, all y lie on a
regression line for which b 0; if R ² = 0 then b = 0.
14. R SPxy
(SSxSSy)
‘Pearson Product Moment Correlation Coefficient’, r. Equal
in magnitude to the square root of the coefficient of
determination. Negative R means y tends to decrease with
increasing x.
Other terms in the R output:
‘t’-values are Student’s-t tests for departures of the intercept from zero, and the slope of Weight
with Age from zero. Note that the value of the Student’s-t test of the slope is equal to the square-
root of the value of F from the Analysis of Variance, and both significances are identical. This is
because both these tests are accomplishing exactly the same task.
‘Residual standard error’ is the square-root of the variance term given by MS(error).
‘Multiple R-squared’ is the coefficient of determination.
‘Adjusted R-squared’ is an adjusted coefficient of determination that is uninfluenced by the
number of d.f.
Lecture notes: Regression
C. P. Doncaster 27
The regression analysis on pages 24 and 25 works by partitioning the total variation in the
response variable into explained and unexplained parts. The total variation is obtained from
summing all the squared deviations of each weight value from the mean weight. The long arrow
on the graph on page 25 illustrates the portion of total variation contributed by just one
observation. The analysis will partition the total variation into its two components, illustrated by
the shorter arrows on the graph. One component is predicted by the regression line:
SS(Explained), while the other is the unexplained variation around the line: SS(Error). The
analysis will then calculate the average squared deviations of these two components, in order
finally to get from their ratio: MS(Explained) / MS(Error), the F-value with which to test the
significance of the regression.
The analysis proceeds in steps. First we find the regression line that will estimate values of y for
each of our values of x. With these predicted values, ŷ, we will then be able to sum their squared
deviations fromy in order to get the explained sum of squares: SS(Explained).
Steps 1 to 5 of the table on page 26. To find the regression line we must find values for two new
parameters: the slope of the line, b, and its intercept with the y-axis, a.
The slope b is calculated from the sum of products: SPxy = (x -x)(y -y) divided by the sum
of squared deviations in x: SSx = (x -x)2. The sum of products on the numerator tells us about
the covariance of y with x. It gives the slope a positive value if the coordinates for each data
point: x, y tend to be either both larger than their respective means: x, y or both smaller. The
slope will have a negative value if in general x <x when y >y, and vice versa. This formula for
b also means that the gradient of the slope will have a magnitude of one if, on average, each
deviation |y -y| has the same magnitude as each corresponding deviation |x -x|. If the deviations
in y are relatively greater than those in x, then the slope will be steeper than 1. The Excel sheet on
page 25 shows that the regression line on the graph has a gradient of –18.217, signifying that y is
predicted to decrease as x increases and that each decrease in y is predicted to be some 18 times
the corresponding increase in x.
The intercept a is calculated from a =y - b x. This is simply a rearrangement of the equation
for a straight line: y = a + bx. In this case we have known values for the two variables y and x, in
their respective sample meansy andx, and since we have just calculated b, we can now find the
unknown a.
With values for a and b, we have all the information we need to draw the regression line on the
graph. Excel can do this for us if we request ‘Add Trendline...’ from the ‘Chart’ menu. The result
is shown on the graph on page 25, and it accords with the equation: the line appears to intercept
the y-axis somewhere around 400 g, and the calculated a tells us it is exactly at y = 420.576 g.
Steps 6 to 8. With the two parameters b and a we can predict Weight, ŷ, for any given value of
Age, x. For each observed x we now calculate (ŷ -y )2 (column L of the Excel sheet) and sum
them to get the explained sum of squares: SS(Explained).
Steps 9 to 12. Finally, we need the unexplained sums of squares, which we get from the squared
deviation of each y from its predicted ŷ. The sum of all these (y - ŷ)2 (in column N of the Excel
sheet) is then the SS(Error). Now we calculate the mean squares: MS(Explained) and MS(Error),
and the F-statistic, in just the same way as for any other Analysis of Variance.
Steps 13 to 14. There remains one final parameter to calculate: the proportion of explained
variance, which is simply SS(Explained) / SS(Total). We call this fraction the ‘coefficient of
determination’ r2. Its square root is called the ‘Pearson product-moment correlation coefficient’
r. Step 14 of the Table on page 26 shows how r is calculated directly, which results in it having a
positive or negative value according to whether the regression is positive or negative.
Practical: Two-way Analysis of Variance in R
C. P. Doncaster 29
PRACTICAL: TWO-WAY ANALYSIS OF VARIANCE IN R
Do this analysis in RStudio (refer to the ‘Using RStudio – Help Guide’ on Blackboard). Prepare a
short report to the pharmaceutical company that makes the drug Ritalin, evaluating the utility of
their product (1 side A4). Divide your report into sections: an Introduction to explain the interest
in doing the test; Experimental Design and Analysis outlined briefly; Results, including the
ANOVA table showing Sums of Squares and Mean Squares etc, with an interpretation of the
analysis in the form: “the effect of the drug depended / did not depend on the condition of the
subject (F = #.##; d.f. = ##, ##; P = #.##) ... the main effect of treatment… etc.” Interpret the main
effects after the interaction. Include a fully annotated ‘interactions plot’. Finish with a short
paragraph of Conclusions about appropriate use of the drug.
Two-way Analysis of Variance
In the previous class practical you conducted a one-way Analysis of Variance. ‘One-way’ meant
that you were looking for differences between mean treatment effects for a single independent
factor (pesticide). Sometimes we are interested in responses to more than one independent factor,
and then it is possible to conduct an Analysis of Variance with two or more main effects. The
example below takes you through a two-way Analysis of Variance that you can perform for
yourself in R. It illustrates how analysis of two independent variables can yield informative
inferences. You may find that the output you get is easier to interpret after reading the
accompanying lecture notes on two-way Analysis of Variance.
Rationale
The drug Ritalin was designed to calm hyperactive children, but hyperactivity is a difficult
condition to diagnose, so it is important to know what effect Ritalin has on non-hyperactive
children. The following medical trial tested two groups of children, one non-hyperactive and the
other hyperactive. Each group was randomly divided with one half receiving Ritalin in tablet form,
and the other half a placebo (a salt tablet with no physiological effect). The following activity
responses were recorded on the four samples each of 4 children:
TREATMENT
Placebo Ritalin
CONDITION Non-hyperactive 50, 45, 55, 52 67, 60, 58, 65
Hyperactive 70, 72, 68, 75 51, 57, 48, 55
In this experimental design, the two independent variables are CONDITION (non-hyperactive or
hyperactive) and TREATMENT (placebo or Ritalin). Each CONDITION is tested with each level
of TREATMENT on replicate subjects. A design of this sort is called a ‘factorial design’ and it
allows us to test for a possible interaction between the two factors in their effects on the response
variable. Here the interaction we are seeking is whether the effect of Ritalin on activity depends on
the condition of the child. This could be a good thing, if for example the drug only influences
hyperactive children, or it could provide cautionary information, if the drug is found to have a
more pronounced effect on non-hyperactive than hyperactive children.
Analysis with R
Enter these data into a data frame from a .csv file (command line shown in the two-way
ANOVA lecture) or a .txt file (command line shown in the regression lecture). The data frame
should have 16 rows, one for each score labelled with its combination of treatment-by-condition:
Practical: Two-way Analysis of Variance in R
C. P. Doncaster 30
Treatment Condition Activity
Placebo Nonhyp 50
Placebo Nonhyp 45
Placebo Nonhyp 55
Placebo Nonhyp 52
Ritalin Nonhyp 67
Ritalin Nonhyp 60
: : :
Then use the same R commands as for the speed-reading analysis on page 16 to run the analysis
and produce an interaction plot. This requires that you specify the response variable and
explanatory factors in an ANOVA model of the form: ‘response ~ factor_1*factor_2’, meaning:
‘variation in the response is explained by the additive effects of factors 1 and 2 and by their
interaction’. You could equally spell out the model without using the ‘*’shorthand: ‘response ~
factor_1 + factor_2 + factor_1:factor_2’. Both expressions give identical results. In this case, the
model you are going to test with Analysis of Variance is that activity is influenced by treatment
and by the child’s condition, and by the interaction of treatment with condition. The model tests
these explained sources of variation in the response against unmeasured ‘residual’ variation.
Save the interaction plot and copy it into your report on the analysis.
Now check the residuals, by nesting the ‘aov(…)’ command within a ‘plot(…)’ command
(see example on page 14). The first two graphs suffice to show homogeneous variances – which is
the most important consideration, though with a rather flat distribution of residuals.
As with everything in R, if you are not sure how to do something, try it and see – you can’t break
the package! Save your commands in a ‘script’ file, so that you can use them again in the future,
and refer to them to see how you did things in the past. Do search the web for help, as usually
someone will have posted an answer to someone else’s similar problem. For example, if you want
to know more about interpreting the Normal Q-Q plot of residuals, try Googling ‘normal Q-Q
plots in R showing skew’.
Peruse the results of the ANOVA, noting that a separate F-value and associated p-value have
been produced for each of the main effects Treatment and Condition, and for the Treatment-by-
Condition interaction. Which effects are significant? How do we interpret these results? Refer to
the lecture notes on two-way ANOVA to be sure which d.f. apply to each F-value.
Interpretation
The analysis reveals something very interesting from a medical point of view, though it needs the
interaction plot to understand it. This plot illustrates qualitatively what the ANOVA described
statistically, and it unmasks the full effect of the drug… Hyperactive children are less active on
average with the drug than with the placebo. That is to be expected, but Non-hyperactive children
are more active on average with the drug than with the placebo. This is the significant interaction
effect that you will have obtained in the ANOVA. For each Treatment level, the point midway
between the two condition-level means indicates that Treatment-level mean after pooling levels of
Condition. These midway points are at an Activity score of about 58 for both Placebo and Ritalin,
which explains the non-significant main effect of Treatment. Does a non-significant main-effect of
Treatment indicate that the drug is ineffectual? No! The significant interaction means that the full
effects of the drug become apparent only when the condition of the children is taken into account.
Ritalin does affect activity, but although it subdues hyperactive children it raises the activity of non-
hyperactive children. This is one reason why it is a controversial drug that must be prescribed only
to hyperactive children. The take-home message for interpreting two-way ANOVA is to read the
ANOVA table from the bottom up, because the main effects only make sense in the light of the
interaction.
Lecture notes: Correlation and transformations
C. P. Doncaster 31
LECTURE: CORRELATION AND TRANSFORMATIONS
Review of ANOVA procedures in regression
We have seen how the significance of a simple regression line is calculated by one-way Analysis
of Variance. Our example used the statistical model: Weight = Age + . We evaluated how good
a predictor Age is in this model by partitioning the total observed variation in weight (measured
as the sum of squared deviations from the sample mean: [ y -y ]2 ) into a portion explained by
the line of best fit for Age against Weight (SS[Age] = [ŷ -y ]2), and an unexplained portion
(SS[] = [ y - ŷ ]2). We could then work out our F-statistic from the ratio of average explained
variation to average unexplained variation: F1,n-2 = MS[Age] / MS[].
Just as you can expand ANOVA from a one-way to a two-way analysis by introducing a second
factor (as we did in Lecture 2 and Practical 2 in this series), so you can expand regression from
simple- to multiple-regression, by introducing a second factor.
This second factor may be categorical, in which case you can plot the response variable against
the continuous factor, and calculate one regression line for each level of the categorical factor. If
the regression lines are not horizontal then you may have a significant continuous factor, and if
the lines do not coincide then you may have a significant categorical factor. If the regression lines
have different slopes, then you may have a significant interaction effect. The interaction plots
shown on p. 20 of this booklet illustrate some of the range of outcomes you could get - just think
of the x-axis as representing some continuous variable instead of the categorical factor ‘Sex’ (for
example ‘Age’), and the lines joining sample means then become regression lines for each level
of the categorical factor (in this case, ‘System’).
If the second factor is continuous rather than categorical, then you will need to illustrate these
data in a 3-dimensional graph, with the response on the vertical axis, and the two continuous
factors on orthogonal (i.e. ‘at right-angles’) horizontal axes. The best-fit model will then be a
plane through the data, as opposed to lines through the data.
With these more complicated models, the Analysis of Variance should be done with a balanced
design, so the same number of observations are recorded at each combination of factor levels.
The design can become unbalanced by missing data, or by using explanatory factors that are
correlated with each other and therefore non-orthogonal. For example if variation in body height
is modelled against right-leg length and against left-leg length, the second-entered explanatory
variable will appear to have no power to explain height while the first-entered explanatory
variable may appear highly significant. The problem is that the two variables are correlated with
each other, so the design is unbalanced by having missing data on short-left and long-right legs
and on short-right and long-left legs. In effect, the variables are not orthogonal to each other.
Having accounted for the variation explained by the first-entered factor there is then necessarily
little variation left over for explanation by the second-entered factor. The true relationship would
be better analysed with a one-factor regression on a single composite explanatory variable of ‘leg
length’ that uses the average of left and right lengths. For more on this topic see Doncaster &
Davey (2007 Analysis of Variance and Covariance, pages 237-242).
Lecture notes: Correlation and transformations
C. P. Doncaster 32
Correlation
For some types of investigation of covariance between continuous variables we may wish to seek
correlation without making predictions about how one variable is influenced by the other. For
example, if we have measures of body Volume for each Weight, we may not have an a priori
reason for knowing whether Volume determines Weight, or Weight determines Volume.
For the analysis of Weight and Age, in contrast, Age was clearly an explanatory (predictor, x)
variable and Weight the response (y) variable. The analysis of those two factors was predictive
because Age was hypothesised to influence Weight, but Weight could not under any
circumstances influence Age. Wherever we have employed Analysis of Variance up to now, it
has been used to explain variation in a response variable in terms of a predicted effect.
For the analysis of Weight and Volume we may not have a priori reasons for classifying one
variable as ‘effect’ and the other as ‘response’. We then restrict ourselves to seeking an inter-
dependency, or an association, between the two continuous variables. We can test for association
with the correlation coefficient r, because it’s value does not depend on which variable is on
which axis. The strength of correlation can still be tested with the Student’s-t or the Analysis of
Variance, as on page 24, because both these tests remain unchanged regardless of which variable
is x and which y.
The equation of the regression line does change, however, if we swap the axes. We can see what
happens to it by manipulating the regression we did of Weight with Age (pp. 24-25 and practical
3 - you can try this with the Excel sheet that you create for the practical). The equation for the
regression on page 25 was:
Weight = 420.6 – 18.2 Age
If the axes are swapped, a new regression equation is yielded: Age = 11.2 – 0.015 Weight,
which can be rearranged in terms of weight to give:
Weight = 724.5 – 64.9 Age
These two equations give entirely different predictions for weight change with age, and only the
first one is correct. The second equation illustrates the kind of error that you might get if you
used regression without respecting the requirement always to put the response variable on the
vertical axis and the predictor variable on the horizontal axis. The first equation predicts
correctly that cubs have an average weight at birth of 420 g (when Age = 0) and an average loss
rate of 18 gday-1, whereas the second equation erroneously predicts an average birth weight 1.7
times greater, and an average rate of weight loss 3.5 times greater, than these figures.
If you are in doubt about whether one of your variables is a true predictor, then do not put a line
of best fit through the plot. Just stick to the simple correlation coefficient r for evaluating the
association between the two variables. Use r instead of r2 because the sign of r provides valuable
information about whether the variables are positively or negatively correlated with each other.
Remember, however, that the correlation coefficient does assume the two variables have a linear
relation to each other. A perfect linear relation will return a value of |r| = 1.0, but a perfect curved
relation will return a value of |r| < 1.0. If your variables are not related to each other in some
direct proportion, then you may need to transform one or other axis in order to linearize the
relation (see p. 35).
Lecture notes: Correlation and transformations
C. P. Doncaster 33
The graphs below illustrate some types of correlation (from Fowler et al. 1998 Practical
Statistics for Field Biology. Wiley). Note that the last graph, of perfect rank correlation, would
give Spearman’s rank correlation coefficient rs = 1.0, which is clearly an over-estimate of the
true level of correlation. The non-parametric Spearman’s coefficient is simply Pearson’s
coefficient calculated on the ranks. Use the parametric Pearson’s in preference to Spearman’s
wherever you can meet its assumptions.
Lecture notes: Correlation and transformations
C. P. Doncaster 34
Transforming data to meet the assumptions of parametric Analysis of Variance
Analysis of variance has proved to be a powerful and versatile technique for analysing any kind
of response variable showing some variation around a mean value. We can use ANOVA to
explain this variation in terms of two or more levels of a factor (one-way ANOVA), or in terms
of the interacting levels of two or more factors (two-way ANOVA or multi-way ANOVA), or in
terms of one or more continuous factors (simple regression or multiple regression). We can also
use ANOVA to test the evidence for a correlation between two continuous variables.
Wherever you have observations of a continuous variable that you wish to explain in terms of
one or more factors, consider using Analysis of Variance before you think of using non-
parametric statistics. Parametric tests are more powerful because they use the actual data rather
than ranks, and for many types of data there simply is no appropriate non-parametric test (e.g.
regression, two-way analyses with categorical and continuous factors, interactions etc).
Having decided to use parametric Analysis of Variance, you must be aware of its underlying
assumptions (introduced on p. 6 of this booklet). If you also know the ways in which these are
likely to be violated, then you can pre-empt many potential difficulties by applying appropriate
transformations to the data. These are the assumptions:
1. Random sampling, so that your observations are a true reflection of the population from
which you took them.
Is it a problem? This is a basic assumption of all statistical analyses, parametric or
non-parametric. Whether or not it is met depends on sampling strategy. Solution: If
your data do not meet it, then you will have to resample your data.
2. Independent observations, so that the value of one data point cannot be predicted from the
value of another.
Is it a problem? This is a basic assumption of all statistical analyses, parametric or
non-parametric, and it depends on sampling strategy. Solution: If your data do not
meet it, then either resample your data or ‘factor out’ the non-independence by
adding a new explanatory factor (e.g. add the categorical factor ‘Subject’ if you have
repeated measures on each subject).
3. Homogeneity of variance around a regression line (for a covariate), or of variances around
sample means (for a factor), because the ANOVA uses pooled error variances to seek
differences between means, and it does not seek differences between variances.
Is it a problem? Depends on the type of observations. Often violated by observations
that cannot take negative values, such as weight, length, volume, counts etc, because
these are likely to have a variance that increases with the mean. Solution: log-
transformation of response (which for regression and correlation may then require
log-transformation of x also, to reinstate linearity).
4. Normal distribution of residual variation around a regression or around sample means,
because this distribution is described by just two parameters: the mean and variance, which
are the two employed by ANOVA (a skewed distribution needs to be described with a third
parameter, not accounted for in ANOVA).
Is it a problem? Generally less than heterogeneity, and depends on the type of
observations. May be violated by observations in the form of proportions or
percentages, because they are constrained to lie between zero and 1 or 100, whereas
the normal distribution has tails out to plus and minus infinity. Also violated by
observations in the form of counts, which follow a Poisson rather than a normal
distribution. Solution: Arcsine-root transformation of proportions, or logistic
regression on proportions (which assumes binomial rather than normal errors).
Lecture notes: Correlation and transformations
C. P. Doncaster 35
Square-root transformation of counts, or use a Generalised Linear Model (the ‘glm’
command in R) which can assume Poisson errors.
5. For regression and correlation: Linear relations between continuous variables, because the
explained and residual components of variation are measured against a predicted line
defined by just two parameters, the intercept a and slope b. A non-linear relation would
need describing with additional parameters, not accounted for in the regression analysis.
Is it a problem? Depends on the type of observations. Most likely to be violated by
relationships with an inherently non-linear biology. Solution: reinstate linearity with
an appropriate transformation to one or both axes – see four examples below.
Consider fitting a polynomial only if it makes sense biologically to model the
response with additive powers of the predictor.
If any of assumptions 3-5 are not met, we should not immediately abandon the use of parametric
statistics. The command ‘glm’ will run a General Linear Model that can accommodate Analysis
of Variance on data with inherently non-normal distributions, such as proportions (which have a
binomial distribution), or frequencies of rare events (with a Poisson distribution and variance
increasing with the mean response). Commands of the sort aov(Y ~ A) or lm(Y ~ A), which
we have been using up to now, have an equivalent in glm: anova(glm(Y ~ A, family =
gaussian(link = identity)), test = "F"). You can replace gaussian (for a normal
distribution) with poisson or binomial, as dictated by the type of data. This website shows a
worked example, for its model 5.9:
http://www.southampton.ac.uk/~cpd/anovas/datasets/ANOVA in R.htm
An alternative route to meeting the assumptions is by transformation of the response (commonly
with an arcsin-root transformation for proportions, or a square-root transformation for counts, or
a generic Box-Cox transformation). This is less desirable than modelling the error structure with
glm, because the transformation changes the nature of the test question.
For regression analyses in particular, you may have a priori reasons for suspecting a non-linear
relationship of response to predictor. An understanding of the underlying biology will often
suggest an appropriate linearizing transformation. Transformations are not cheating, because they
are planned in advance, and the same conversion is applied to all observations. The idea is to
reduce complexity by converting a non-linear relation to a linear one. Here are some examples:
1. The response may be inherently exponential, for example in population growth over time of
freely self-replicating organisms. A linear regression on ln(population) against time will give a
slope that equals the intrinsic rate of natural increase per capita.
2. Response and predictor may have different dimensions, for example in a weight response to
length (see p. 39), suggesting a power function. Logging both axes will linearize power-function
relationships, and simultaneously deal with associated issues of the variance increasing with the
mean response and skewed residuals.
3. The response may saturate, for example in the response of weight increase to body weight, or
the response of food consumption to food abundance. Linearization is achieved by understanding
the underlying biology: try inverse body weight, and try inverse consumption and abundance.
4. The response may be cyclic, for example in a circadian rhythm. Transformation of the
predictor with circular function (e.g., sin(x) or cos(x)) may linearize the relationship.
If you resort to non-parametric methods, be aware, that they all make assumptions 1 and 2 above.
Also, statistics on ranks (e.g., Spearman’s correlation) require that the ranks meet assumptions 3-
5. Finally, some data may not suit any statistics because they have too little variation (e.g. when
skewed by numerous zero values) or insufficient replication (e.g. data with too many missing
values). In such cases, change your test question to allow sub-sampling from the dataset.
Lecture notes: Fitting statistical models
C. P. Doncaster 37
LECTURE: FITTING STATISTICAL MODELS TO DATA
Statistical packages like R all work by fitting models to data. They require you to use an
appropriate model for the samples and variables under investigation, before they will estimate
parameter values that best fit the data. These pages will help you fit appropriate models to data.
In the first example (A1) below, the model formula is a mathematical relationship:
describing the probability of obtaining exactly 0, 1, 2,... species of insects per leaf. But the other
examples all use a standard convention for presenting statistical models, which takes the form:
response variable(s) = explanatory variable(s). Here the ‘=‘ sign is simply a statement of the
hypothesised relationship between the variables rather than a logical equality. The chosen
statistic will quantify the relationship of the response variable (continuous except in A2a) to the
explanatory variables (which can be continuous: A2b & B1, or divided into samples: A3 & B2).
A. The three principal types of data and statistical models
1. One sample, one variable
For data of this kind, look for a goodness-of-fit of frequencies
E.g. The sample is 50 leaves of Sycamore picked at random; the variable is the number of species
of insect parasites per leaf. This is predicted to follow a random distribution, so the appropriate
model for calculating expected frequencies is the Poisson distribution. x species per leaf 0 1 2 3 4+ Total
Observed frequencies 3 22 15 6 4 50 Expected frequencies 8.43 15.01 13.36 7.93 5.28 50
(O - E)2 / E 3.50 3.26 0.20 0.47 0.31 7.73
0
5
10
15
20
25
0 1 2 3 4 5 6
Number of species per leaf
Fre
qu
en
cy
of
lea
ve
s
Observed
Expected
H0: Observed distribution is no different to the expected Poisson (i.e. no
interaction between species
Test statistic: Chi-squared or G- test of association
Outcome: X3 2 = 7.73, p < 0.05
Conclusion: observed numbers of species differ from random expectation. Since the
observed distribution is narrower than expected, the species are more
regularly spaced than random, with one per leaf predominating (indicating
mutual repulsion in competition between the species)
Assumptions: data are nominal (not continuous) , frequencies are independent (i.e. 50
independent leaves) , no cell with expected value < 5 .
For continuous data, use Kolmogorov-Smirnov test.
Model formula: Poisson distribution: x = 1.78 species/leaf
Lecture notes: Fitting statistical models
C. P. Doncaster 38
2. One sample, two variables
For data of this kind, look for a dependent relationship (an association) between the variables
(a) Categorical variables
Use a contingency table of frequencies to look for an interaction between the variables
E.g. Sample is 2-year old infants, variables are eye colour and behavioural dominance.
Contingency table Eye colour
Blue Other Total
Behaviour Dominant 13 7 20
Submissive 22 29 51
Total
35
36
71
0
10
20
30
Blue Other Blue Other
Dominant Dominant Submissive Submissive
Variables
Fre
qu
en
cy
H0: Column categories are independent of row categories.
Test statistic: chi-squared or G- test of independence
Outcome: X12 = 1.942, p = 0.16
Conclusion: there is no detectable interaction of colour with behaviour: behavioural
dominance is not associated with blue eyes
Assumptions: data are truly categorical (frequency in each cell conforms to a Poisson
distribution) , frequencies are independent (71 independent subjects, e.g.
no siblings) , no cell with expected value < 5 , correction for
continuity .
For cells with expected values < 5, use Fisher’s exact test.
Model formula: colour:behaviour ~ _response_
Lecture notes: Fitting statistical models
C. P. Doncaster 39
(b) continuous variables
Plot the response variable on the y-axis against the explanatory variable on the x-axis
E.g. Sample is polar bears; response variable is body weight and explanatory variable is radius
length.
Subject Body weight (kg) Radius length (cm)
1 2 3 4 5 6 7 8 9
10 11 12 13 14
65 70 74 142 121 80 108 344 371 416 432 348 476 478
45.0 47.5 57.0 59.5 62.0 53.0 56.0 67.5 78.0 72.0 77.0 72.0 75.0 75.0
: 143
: :
: :
H0: Variation in body weight is independent of radius length.
Test statistic: Linear regression on transformed weight and radius length (Ln[Weight]
labelled as a new variable ‘ln.Weight’; ln[Length] labelled ‘ln.Length’)
Outcome: F1,141 = 944.6, p < 0.0001
Conclusion: the regression slope is differs from zero; radius length is a precise
predictor of body weight, explaining 87% of the variance in body weight
with the chosen model.
Assumptions: (i) random sampling , (ii) independent errors , (iii) homogeneity of
variances , (iv) normal distribution of errors , (v) linearity .
For continuous variables with no clear functional relationship, use correlation to calculate r.
Model formula: ln.Weight ~ ln.Length
Lecture notes: Fitting statistical models
C. P. Doncaster 40
3. One-way classification of two (or more) samples
For data of this kind, look for a difference between sample means
E.g. Samples are two levels of a feeding regime for shrews: a diet of blow-fly pupae, and a diet
of dung-fly pupae. The response variable is weight (g).
Feeding regime
blow-fly diet (g) dung-fly diet (g)
5 2
10 6 5 8 4 2 7
12 9 3
4 10 7 9
15 12 8
11 13 17 5
10 11
n subjects = 12 13
Mean = 6.08 10.15
Standard error = 0.92 1.02
0
4
8
12
16
Blowfly Dungfly
Diet
Bo
dy
we
igh
t (g
)
H0: Feeding regime has no effect on weight (the two samples come from the
same population)
Test statistic: Analysis of Variance (or t-test when just two groups)
Outcome: F1,23 = 8.60, p < 0.01
Conclusion: shrew body weights depend on type of feeding regime
Assumptions: (i) random sampling , (ii) independent errors , (iii) homogeneity of
variances , (iv) normal distribution of errors .
For data with repeated measures on subjects (assumption (ii)), use repeated measures ANOVA;
for data that violate assumptions (iii) - (iv) use prior transformations, or use non-parametric
Kruskal-Wallis test (or Mann-Whitney if have just two samples).
Model formula: weight ~ regime
Lecture notes: Fitting statistical models
C. P. Doncaster 41
B. Selecting and fitting models to data
R offers many alternative commands for Analysis of Variance. The command ‘aov’ will suit
most straightforward analyses with normally distributed residuals. The command ‘glm’ will run
a General Linear Model that can accommodate Analysis of Variance on data with inherently non-
normal distributions, such as proportions (which have a binomial distribution), or frequencies of
rare events (with a Poisson distribution).
1. One-way classification of two (or more) samples, two continuous variables
For data of this kind, look for differences between regression slopes
E.g. Samples are male (circles and continuous line) and female (triangles and broken line) polar
bears; response variable is body weight and explanatory variable is radius length.
Subject Body weight (kg) Radius length (cm) Sex
1 2 3 4 5
65 70 74 142 121
45.0 47.5 57.0 59.5 62.0
M F F M F
: 143
: :
: :
: :
H0: Variation in body weight is independent of radius length by sex.
Test statistic: Analysis of Variance on ln.Weight with covariate ln.Length (or General
Linear Model for non-normal error structures).
Outcome: ln.Length effect (adjusted for Sex) F1,139 = 1003.66, p < 0.0001
Sex effect (adjusted for ln.Length) F1,139 = 3.57, p = 0.06
Sex-by-ln.Length interaction F1,139 = 7.24, p = 0.008
Conclusion: the two regression lines have different slopes, so the effect of radius length
on weight differs by sex
Assumptions: (i) random sampling , (ii) independent errors , (iii) homogeneity of
variances , (iv) normal distribution of errors , (v) linearity .
Model formula: ln.Weight ~ ln.Length + Sex + ln.Length:Sex
Lecture notes: Fitting statistical models
C. P. Doncaster 42
2. Two-way classification of samples
For data of this kind, look for two-way differences between means
E.g. Shrew samples are classified by feeding regime and sex; response variable is body weight as
in Analysis of Variance above.
Feeding regime
blow-fly diet dung-fly diet
Sex
females
2 2 9 4 5 5
10 11 5
13 15 17 11
males
6 7 8 3 10 12
4 12 7 8 9
10
0
4
8
12
16
Blowfly Dungfly
Diet
Bo
dy
we
igh
t (g
)
Female
Male
H0: The effect of regime on weight is not affected by sex
Test statistic: Analysis of Variance (or General Linear Model for non-normal error
structures).
Outcome: sex effect (adjusted for regime) F1,21 = 0.01, p = 0.933
regime effect (adjusted for sex) F1,21 = 9.68, p < 0.005
regime:sex interaction effect F1,21 = 6.68, p < 0.05
Conclusion: the effect of regime on weight depends on sex, with females doing better
on dungflies and males on blowflies
Assumptions: (i) random sampling , (ii) independent errors , (iii) homogeneity of
variances , (iv) normal distribution of errors .
Model formula: weight ~ regime + sex + regime:sex
Practical: Regression and correlation
C. P. Doncaster 43
PRACTICAL: CALCULATING REGRESSION AND CORRELATION
In this practical you will do ‘by hand’ the linear regression shown on pages 24-25 of this booklet.
To save tedious calculations, however, you will put Excel to work by asking it to do all of the
arithmetic for you. This still means that you will need to understand how the regression analysis
works, so refer to pages 26-27 as you follow the steps through on the computer. Look back
through the notes for lectures 3 and 4 to appreciate the underlying logic of the analysis.
First run the practical in R, using the commands on page 24 of the booklet. Then open up Excel.
On a fresh spreadsheet, type in the data shown in rows 4 to 15 of columns B and F in the Excel
worksheet illustrated on page 25 of this booklet. Don’t type in any more data than just these two
columns. Excel will do the rest! But you have to tell it what to do...
Your task is now to use Excel formulae to obtain all the figures as they appear in the other cells
and columns. Your objective is to replicate the entire sheet shown on page 25 without typing in
any more numbers. When you have done this, save the result, as you may wish to use it again.
In order to use Excel formulae, you must type an ‘=‘ sign in a cell where you wish to calculate a
number from data in other cells. For example, to obtain a value in cell B19 for the mean age, type
in cell B19:
‘=AVERAGE(B4:B15)’
Likewise, to obtain a value in cell F19 for the mean weight, type in cell F19:
‘=AVERAGE(F4:F15)’
Now to obtain a value in cell H4 for the squared deviation of the first Weight value (in cell F4)
from its sample mean (which you have just calculated in F19), type in cell H4:
‘=(F4-$F$19)^2’
Having entered this command, you can repeat it down through the whole of column H from H4
to H15 by clicking on the bottom right corner of the cell and dragging down to H15. Look at the
formulae you have created to check that they are giving you squared deviations of each weight
value from the sample mean. You should now see in column H the full set of squared deviations
of Weights from their sample mean. Now get the sum of squared deviations: SS(Total) in cell
H17 by typing ‘=SUM(H4:H15)’
Likewise, to obtain a value in cell J4 for the product of the first Weight deviation with its
corresponding Age deviation, type in J4:
‘=(B4-$B$19)*(F4-$F$19)’
Then drag that formula down to J15 in order to get all the products. Finally, get the sum of
squared deviations in cell J17 by typing ‘=SUM(J4:J15)’.
Do a similar operation for column D, then calculate the parameters for the slope and intercept of
the line. Use these parameter constants to obtain for each x a predicted y = a +bx, in order to
then calculate the values in columns L and N. Finally calculate the explained and error SS and
MS, and the F-value. Check that your sheet matches the one on page 25. You can then ‘play’
with the data to see what difference it makes to the significance of the relationship if you change
just one of the values. For example, change the Weight value in cell F12 from 431 g to 231 g. Is
the relationship now significant? Has the magnitude of the correlation coefficient r got closer to
unity? Playing with test data in this way will help you to understand how the statistics work. But
don’t try this with real data! If you had actually observed a Weight of 431 g, then you would have
to work with that. If the outcome is a non-significant relationship, then your best explanation is
no detectable relationship (failure to reject H0), given the assumptions of the analysis.
Appendix 1: Terminology of Analysis of Variance
C. P. Doncaster 45
APPENDIX 1: TERMINOLOGY OF ANALYSIS OF VARIANCE
Once you have familiarised yourself with the terminology of Analysis of Variance you will find it
easier to grasp many of the parametric techniques that you read about in statistics books. Some of
the terms described below may be referred to by one of many names, as indicated in the left hand
column. They are illustrated here with a simple example of statistical analysis, in which a biologist
wishes to explain variation in the body weights of a sample of people according to different
variables such as their height, sex and nationality. More detailed descriptions of the terms shown
below, as well as many others that go beyond your immediate needs, can be found in the Lexicon
of Statistical Modelling (http://www.geodata.soton.ac.uk/biology/lexstats.html).
Term Description
1. Variable A property that varies in a measurable way between subjects in a sample.
2. Response variable,
Dependent variable,
Y
The variable of interest, usually measured on a continuous scale, of (e.g.
weight: what causes variation in weight?). If these measurements are free to
vary in response to the explanatory variable(s), statistical analysis will reveal
the explanatory power of the hypothesised source(s) of variation.
3. Explanatory variable,
Independent variable,
Predictor variable,
Factor,
Effect,
X
The non-random measurements or observations (e.g. treatments of a ‘drug’
factor, fixed by experimental design), which are hypothesised in a statistical
model to have predictive power over the response variable. This hypothesis is
tested by calculating sums of squares and looking for a variation in Y between
levels of X that exceeds the variation within levels. An explanatory variable
can be categorical (e.g. sex, with 2 levels of male and female), or continuous
(e.g. height with a continuum of possibilities). The explanatory variable is
assumed to be ‘independent’ in the sense of being independent of the response
variable: i.e. weight can vary with height, but height is independent of weight.
The values of X are assumed to be measured precisely, without error,
permitting an accurate estimate of their influence on Y.
4. Variates,
Replicates,
Observations,
Scores,
Data points
The replicate observations of the response variable ()
measured at each level of the explanatory variable. These are the data points,
each usually obtained from a different subject to ensure that the sample size
reflects n independent replicates (i.e. it is not inflated by non-independent
data: ‘pseudoreplication’).
5. Sample,
Treatment
The collection of observations measured at a level of X (e.g. body weights
from one sample of males and another of females to test the effect of Sex on
Weight; or crop Yield tested with two Pesticide treatments). If X is continuous
the sample comprises all measures of Y on X (e.g. Weight on Height).
6. Sum of squares The squared distance between each data point, , and the sample mean,Y,
summed for all n data points. The squared deviations measure variation in a
form which can be partitioned into different components that sum to give the
total variation (e.g. the component of variation between samples and the
component of variation within samples).
7. Variance The variance in a normally distributed population is described by the average
of n squared deviations from the mean. Variance usually refers to a sample,
however, in which case it is calculated as the sum of squares divided by n-1
rather than n. Its positive root is then the standard deviation, SD, which
describes the dispersion of normally distributed variates (e.g. 95% lying within
1.96 standard deviations of the mean when n is large).
Appendix 1: Terminology of Analysis of Variance
C. P. Doncaster 46
8. Statistical model,
Y = X +
A statement of the hypothesised relationship in the sampled population
between the response variable and the predictor variable. A simple model
would be: Weight = Sex + . The ‘=‘ does not signify a literal equality, but a
statistical dependency. So the statistical analysis is going to test the hypothesis
that variation in the response variable on the left of the equals sign (Weight) is
explained or predicted by the factor on the right (Sex), in addition to a
component of random variation (the error term , ‘epsilon’). An Analysis of
Variance will test whether significantly more of the variation in Weight falls
between the categories of ‘male’ and ‘female’, and so is explained by the
independent variable ‘Sex’ than lies within each category (the random
variation ). The error term is often dropped from the model description
though it is always present in the model structure, as the random variation
against which to calibrate the variation between levels of X in the F-ratio.
9. Null hypothesis,
While a statistical model proposes a hypothesis, e.g., that Y depends on X, the
statistical analysis can only seek to reject a null hypothesis: that Y does not
vary with X in the population of interest. This is because it is always easier to
find out how different things are than to know how much they are the same, so
the statistician’s easiest objective is to establish the probability of a deviation
away from random expectation rather than towards any particular alternative.
Thus does science in general proceed cautiously by a process of refutation. If
the analysis reveals a sufficiently small probability that the null hypothesis is
true, then we can reject it and state that Y evidently depends on X in some way.
10. One-way ANOVA,
Y = X
An Analysis of Variance (ANOVA) to test the model hypothesis that variation
in the response variable Y can be partitioned into the different levels of a
single explanatory variable X (e.g. Weight = Sex). If X is a continuous
variable, then the analysis is equivalent to a linear regression, which tests for
evidence of a slope in the best fit line describing change of Y with X (e.g.
Weight with Height).
11. Two-way ANOVA,
Y = X1 + X2 + X1X2
Test of the hypothesis that variation in Y can be explained by one or both
variables X1 and X2. If X1 and X2 are categorical and Y has been measured
only once in each combination of levels of X1 and X2, then the interaction
effect X1X2 cannot be estimated. Otherwise a significant interaction term
means that the effect of X1 is modulated by X2 (e.g. the effect of Sex, X1, on
Weight, Y, depends on Nationality, X2). If one of the explanatory variables is
continuous, then the analysis is equivalent to a linear regression with one line
for each level of the categorical variable (e.g. graph of Weight by Height, with
one line for males and one for females): different intercepts may signify a
significant effect of the categorical variable, different slopes may signify a
significant interaction effect with the continuous variable.
12. Error,
Residual
The amount by which an observed variate differs from the value predicted by
the model. Errors or residuals are the segments of scores not accounted for by
the analysis. In Analysis of Variance, the errors are assumed to be independent
of each other, and normally distributed about the sample means. They are also
assumed to be identically distributed for each sample (since the analysis is
testing only for a difference between means in the sampled population), which
is known as the assumption of homogeneity of variances.
13. Normal distribution A bell-shaped frequency distribution of a continuous variable. The formula for
the normal distribution contains two parameters: the mean, giving its location,
and the standard deviation, giving the shape of the symmetrical ‘bell’. This
distribution arises commonly in nature when myriad independent forces,
themselves subject to variation, combine additively to produce a central
tendency. The technique of Analysis of Variance is constructed on the
assumption that the component of random variation takes a normal
distribution. This is because the sums of squares that are used to describe
Appendix 1: Terminology of Analysis of Variance
C. P. Doncaster 47
variance in an ANOVA accurately reflect the true variation between and
within samples only if the residuals are normally distributed about sample
means.
14. Degrees of freedom,
d.f.
The number of pieces of information that we have on a response, minus the
number needed to calculate its variation. The F-ratio in an Analysis of
Variance is always presented with two sets of degrees of freedom, the first
corresponding to one less than the a samples or levels of the explanatory
variable (a - 1), and the second to the remaining error degrees of freedom (n -
a). For example, a one-way ANOVA may find an effect of nationality on body
weight ( = 3.10, p < 0.05) in a test of four nations (giving the 3 test
degrees of freedom) sampled with 27 subjects (giving the 23 error degrees of
freedom). A continuous factor has one degree of freedom, so the linear
regression ANOVA has 1 and n-2 degrees of freedom (e.g. a height effect on
body weight: = 4.27, p < 0.05, from 27 subjects).
15. F-statistic,
F-ratio
The statistic calculated by Analysis of Variance, which reveals the
significance of the hypothesis that Y depends on X. It comprises the ratio of
two mean-squares: MS[X] / MS[]. The mean-square, MS, is the average sum
of squares, in other words the sum of squared deviations from the mean X or (as defined above) divided by the appropriate degrees of freedom. This is why
the F-ratio is always presented with two degrees of freedom, one used to create
the numerator MS[X], and one the denominator, MS[]. The F-ratio tells us
precisely how much more of the total variation in Y is explained by X (MS[X])
than is due to random, unexplained, variation (MS[]). A large ratio indicates a
significant effect of X. In fact, the observed F-ratio is connected by a very
complicated equation to the exact probability of a true null hypothesis, i.e. that
the ratio equals unity, but you can use standard tables to find out whether the
observed F-ratio indicates <5% probability of making a mistake in rejecting a
true null hypothesis.
16. Significance,
p
This is the probability of mistakenly rejecting a null hypothesis that is actually
true. In the biological sciences a critical value = 0.05 is generally taken as
marking an acceptable boundary of significance. A large F-ratio signifies a
small probability that the null hypothesis is true. Thus detection of a
nationality effect: = 3.10, p < 0.05 means that the variation in weight
between the samples from four nations is 3.10 times greater than the variation
within samples, making these data incompatible with a null hypothesis of
nationality having no effect on weight. The height effect detected in the linear
regression ( = 4.27, p < 0.05) means that the distribution of data is
incompatible with height having no influence on weight in the sampled
population. This regression line takes the form: , and 95%
confidence intervals for the estimated slope are obtained at ; if
the slope is significant, then these intervals will not encompass zero.
Appendix 2: Self-test questions on Analysis of Variance
C. P. Doncaster 49
APPENDIX 2: SELF-TEST QUESTIONS ON ANALYSIS OF VARIANCE
1. Write down the formula for calculating the variance of a sample of scores (use Yi to denote a
score for each of n subjects). Explain in words what is meant by this expression.
2. Write down the formula for the standard error of the mean. Explain in words what is meant by
this expression. Why does it get smaller as n increases?
3. A sample of 8 male blackbirds are tested for response times to an alarm signal, and this is
compared to responses of a sample of 9 females. The Analysis of Variance gives a value of F =
4.56. Use tables of critical values of F to decide whether mean responses differ between males
and females. The problem could also have been answered with a t-test, in which case the test
would have produced a value of t = 2.135, which is the square root of 4.56. For both tests,
critical values are looked up in tables using the same error degrees of freedom. Look up the
critical value of t at = 0.05 and then square it. Check that this corresponds with the equivalent
critical value of F. This shows you that an ANOVA on two samples is equivalent to a t test.
4. State the model for the above Analysis of Variance. If we increased the sample sizes to 12 of
each sex and added a third sample of 12 neutered males, what would be the degrees of freedom
for the Analysis of Variance?
5. If we divided each of the samples into three groups, of 4 chicks, 4 juveniles, and 4 adults, we
could then test the alarm response against two independent effects: SEX and AGE. Write out
the full model and give the degrees of freedom for each term.
6. If SEX and AGE main effects were significant, but the SEX:AGE interaction was not, sketch
out how the interaction plot might look. Sketch another plot showing how it might look if the
interaction effect was also significant.
7. How would you interpret the outcome of the experiment if the interaction effect was
significant?
8. As part of your research project, you want to find out how root growth of lawn grasses is
influenced by frequency of mowing under different conditions of watering. You decide to use
urban gardens as sources of independent grass plots, purloining the services of willing
householders to provide different mowing and watering regimes. Describe how you would
design your methods so that the data could be analysed with a two-way Analysis of Variance.
(Hint: think how you want the data to look in a design matrix of the sort we have been using in
previous examples - requires thinking through carefully!).
9. Interpret the following output from a statistics package
The regression equation is
Log(survival) = 11.4 + 11.6 Temperature
Predictor Coef StDev T P
Constant 11.417 5.309 2.15 0.047
Temperat 11.6115 0.9931 11.69 0.000
S = 2.953 R-Sq = 89.5% R-Sq(adj) = 88.9%
Analysis of Variance
Source DF SS MS F P
Regression 1 1192.1 1192.1 136.71 0.000
Residual Error 16 139.5 8.7
Total 17 1331.6
Appendix 3: Worked examples of Analysis of Variance
C. P. Doncaster 51
APPENDIX 3: SOURCES OF WORKED EXAMPLES IN ANALYSIS OF
VARIANCE
1. One-way Analysis of Variance
Fowler, J. & Cohen, L. 1998. Practical Statistics for Field Biology. John Wiley. Chapter 17.
Section 17.3 (p. 181)
Samuels, M.L. 1991. Statistics for the Life Sciences. Maxwell Macmillan. Chapter 12.
Example 12.1-12.9 (p. 390-406)
Exercises 12.1-12.14 (with answers at back of book)
Sokal, R.R. & Rohlf, F.J. 1995. Biometry, 3rd Edition. Freeman. Chapters 8 and 9.
Table 8.1 (p. 181) and Table 8.5
Table 8.3 (p. 192) and Table 8.6
Box 9.1 (p. 210) - unequal sample sizes
Box 9.4 (p. 218) - equal sample sizes
Zar, J.H. 1984. Biostatistical Analysis, 2nd Edition. Prentice-Hall. Chapter 11.
Example 11.1 (p. 164)
2. Two-way Analysis of Variance
Fowler, J. & Cohen, L. 1998. Practical Statistics for Field Biology. John Wiley. Chapter 17.
Section 17.6 (p. 190)
Sokal, R.R. & Rohlf, F.J. 1995. Biometry, 3rd Edition. Freeman. Chapter 11.
Box 11.1 (p. 324) - cross factored analysis
Table 11.1 (p. 327) - meaning of interaction: equivalent to Fig. 1.7 in your ANOVA notes.
Box 11.2 (p. 332)
Zar, J.H. 1984. Biostatistical Analysis, 2nd Edition. Prentice-Hall. Chapter 13.
Example 13.1 (p. 207)
Appendix 4: Procedural steps for ANOVA
C. P. Doncaster 53
APPENDIX 4: SUMMARY OF PROCEDURAL STEPS FOR ANALYSIS OF
VARIANCE
Assumptions
met?
ANALYSIS OF VARIANCE
F#,# = ##.##, P < 0.0#
INTERPRETATION
•Higher order interactions first
•Equation and r2 for regression
•Pearson's r for correlation
PLOT
TRANSFORM
DIAGNOSTICS
•Random
•Independent
•Normal
•Homogenous
•Linear
NO
YES
OBSERVATIONS
Appendix 5: Self-test questions on regression and correlation
C. P. Doncaster 55
APPENDIX 5: SELF-TEST QUESTIONS ON ANALYSIS OF VARIANCE (2)
1. A colleague tells you he has data on the activity of three daphnia at each of six levels of
pH, and he needs advice on analysis.
a) What extra information do you need to know before you can advise on doing any
statistical tests at all?
b) If you are satisfied that statistical analysis is appropriate, are these data suitable for
Analysis of variance, and/or regression, and/or correlation? Should it be parametric
or non-parametric?
c) Significance would be tested with how many degrees of freedom?
2. You have three samples of wheat grains, one of which comes from genetically modified
parent plants, one from organic farming, and the third from conventional farming. You
want to find out if these different practices make a difference to the weight of seeds. What
are your options for analysis?
a) Regression.
b) Chi-squared test on the frequencies in different weight categories.
c) Kruskal-Wallis test on the three samples.
d) Analysis of variance on the three samples.
e) Student’s t-tests on each combination of pairs to find out how their averages differ
from each other.
3. You have a packet of wild-type tomato seeds and a packet of genetically modified tomato
seeds, and you want to know whether they give different crop yields under a conventional
growing regime and under an ‘organic’ regime. How do you find out?
4. What, if anything, is wrong with each of these reports?
a) “The data is plotted in graph 2, and it shows a significant change with temperature
(F1 = 23.71625, P = 0.000).”
b) “Figure 2 shows that temperature has a strong positive influence on activity across
this range (r2 = 0.78, F1,10 = 23.72, P < 0.001).”
c) “There is a strong negative correlation but the results are not significant (r = -0.64, P
= 0.06).”
d) “No correlation could be established from the nine observations (Pearson’s
coefficient r = -0.64, d.f. = 7, P > 0.05).”
5. Interpret the following command and output from an analysis in R: > summary(aov(Y ~ A*B))
Df Sum Sq Mean Sq F value Pr(>F)
A 2 0.61 0.30 1.393 0.3184
B 1 0.97 0.97 4.465 0.0791 .
A:B 2 136.07 68.03 312.974 8.56e-07 ***
Residuals 6 1.30 0.22
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
Appendix 6: Worked examples of regression
C. P. Doncaster 57
APPENDIX 6: SOURCES OF WORKED EXAMPLES ON REGRESSION
AND CORRELATION
Doncaster, C.P. & Davey, A.J.H. 2007. Analysis of Variance and Covariance: How to Choose
and Construct Models for the Life Sciences. Cambridge University Press.
- Pages 46-57.
- See the book’s web pages for:
Worked examples of all Analysis of Variance models:
http://www.southampton.ac.uk/~cpd/anovas/datasets/
Commands for analysing them in R:
http://www.southampton.ac.uk/~cpd/anovas/datasets/ANOVA in R.htm
Fowler, J. et al. 1998. Practical Statistics for Field Biology. John Wiley.
- Chapters 14-15.
- Section 14.5 (p. 135)
- Section 15.6 (p. 147)
- Sections 15.12 to 15.15 (p. 156)
Samuels, M.L. 1991. Statistics for the Life Sciences. Maxwell Macmillan.
- Chapter 13.
- Numerous examples throughout this chapter, and exercises (pp. 449, 463, 474, 484 and 493,
with answers at back of book)
Sokal, R.R. & Rohlf, F.J. 1995. Biometry, 3rd Edition. Freeman.
- Chapters 14-15.
- Table 14.1 (p. 459)
- Box 14.1 (p. 465)
Zar, J.H. 1984. Biostatistical Analysis, 2nd Edition. Prentice-Hall.
- Chapters 17, 19.
- Examples 17.1 (p. 262), and 17.9 (p. 286).
- Examples 19.1 (p. 308)
Further reference information on statistical modelling with ANOVA and regression can be found
in the Lexicon of Statistical Modelling at: http://www.geodata.soton.ac.uk/biology/lexstats.html.
Appendix 7: Critical values of the F-distribution
C. P. Doncaster 59
APPENDIX 7: CRITICAL VALUES OF THE F-DISTRIBUTION
v1 is the degrees of freedom of the numerator means squares;
v2 is the degrees of freedom of the denominator means squares.
Note that the power of Analysis of Variance to detect differences can be increased if the total
number of variates is divided into more samples. For example:
(i) 2 samples with 9 variates in each, so n = 18, has critical F1,16 = 4.49
(ii) 3 samples with 6 variates in each, so n = 18, has critical F2,15 = 3.68
(iii) 6 samples with 3 variates in each, so n = 18, has critical F5,12 = 3.11
All three tests require collecting the same amount of data. The first one can only detect a
difference in the sampled population if the variance between samples is more than four times
greater than the variance within samples. The third one, in contrast can detect a difference from a
between-sample variance little more than three times greater than the within-sample variance.