63
ANALYSIS OF VARIANCE AND MODEL FITTING FOR R C. Patrick Doncaster http://www.soton.ac.uk/~cpd/ 1 n i i Y Y n

P V P V - Global top 100 university | University of

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

ANALYSIS OF VARIANCE AND

MODEL FITTING FOR R

C. Patrick Doncaster

http://www.soton.ac.uk/~cpd/

1

n

i

i

Y Y n

C. P. Doncaster

C. P. Doncaster

CONTENTS Page

Lecture: One-Way Analysis of Variance .......................................................................... 1

Comparison of parametric and non-parametric methods of analysing variance

What is parametric one-way Analysis of Variance (ANOVA)?

How to do a parametric one way Analysis of Variance

What are degrees of freedom?

Assumptions of parametric Analysis of Variance

Summary of parameters for estimating the population mean

Practical: Calculating One-Way Analysis of Variance ..................................................... 9

Lecture: Two-Way Analysis of Variance ........................................................................ 15

Example of two-way Analysis of Variance: cross-factored design

Using a statistical model to define the test hypothesis

Degrees of freedom

How to do a two-way Analysis of Variance

Using interaction plots

Lecture: Regression .......................................................................................................... 23

Comparison of Analysis of Variance and regression models

Degrees of freedom for regression

Calculation of the slope and intercept of the regression line

Practical: Two-Way Analysis of Variance in R ............................................................... 29

Lecture: Correlation and Transformations .................................................................... 31

The difference between correlation and regression, and testing for correlation

Transforming data to meet the assumptions of parametric Analysis of Variance

Lecture: Fitting Statistical Models to Data ..................................................................... 37

The three principal types of data and statistical models

1. One sample, one variable: G-test of goodness-of-fit

2. One sample, two variables:

(a) Categorical variables: G-test of contingency table

(b) Continuous variables: regression or correlation

3. One-way classification of two or more samples: Analysis of Variance

Supplementary information: Selecting and fitting models

1. One-way classification with two continuous variables: multiple regression

2. Two-way classification of samples: two-factor ANOVA or General Linear Model

Practical: Calculating Regression and Correlation ......................................................... 43

Appendix 1: Terminology of Analysis of Variance .............................................................. 45

Appendix 2: Self-test questions (1)......................................................................................... 49

Appendix 3: Sources of worked examples - ANOVA ........................................................... 51

Appendix 4: Procedural steps for Analysis of Variance ...................................................... 53

Appendix 5: Self-test questions (2)......................................................................................... 55

Appendix 6: Sources of worked examples - Regression ....................................................... 57

Appendix 7: Table of critical values of the F-distribution .................................................. 59

C. P. Doncaster

Lecture notes: One-way Analysis of Variance

C. P. Doncaster 1

LECTURE: ONE-WAY ANALYSIS OF VARIANCE

This booklet covers five lectures and three practicals. It is designed to help you:

1. Understand the principles and practise of Analysis of Variance, regression and correlation;

2. Appreciate their underlying assumptions, and how to meet them;

3. Learn the basics of using statistical models for quantitative solutions.

In meeting these objectives you will also become more familiar with the terminology of

parametric statistics, and this should help you use statistical packages and interpret their output,

and better understand published analyses.

Comparison of parametric and non-parametric methods

You have already been introduced to non-parametric tests earlier in this course. These are useful

because they tend to be robust - they give you a rough but reliable estimate and work well on data

which have an unknown underlying distribution. But often we can be confident about underlying

distributions, and then parametric statistics begin to show their strengths.

Some limitations of non-parametric statistics:

1. They test hypotheses, but do not always give estimates for parameters of interest;

2. They cannot test two-way interactions, or categorical combined with continuous effects;

3. They each work in different ways, with their own quirks and foibles and no grand scheme;

4. In situations of even moderate complexity such as you may encounter when doing research

projects, there may be no non-parametric statistic readily available.

Some advantages of parametric statistics:

1. They can be more powerful because they make use of actual data rather than ranks;

2. Parametric tests are very flexible, coping well with incomplete data and correlated effects;

3. They can test two-way interactions, and also categorical combined with continuous effects;

4. They are all built around a single theme, of Analysis of Variance. So there is a grand scheme,

a single framework for understanding and using them.

What is Analysis of Variance (ANOVA)?

Analysis of Variance is an extension of the Student’s t-test that you will already be familiar with.

A t-test can look for differences between the mean scores in two samples (e.g. body weights of

males and females). A one-way Analysis of Variance can look for an overall difference between

the mean scores in 2 or more samples of a factor (e.g. crop yield under three different treatments

of fertiliser). Later we will see how a two-way Analysis of Variance can further partition the

variance among two factors (e.g. crop yield under different combinations of pesticide as well as

fertiliser).

What does Analysis of Variance do? It analyses samples to test for evidence of a difference

between means in the sampled population. It does this by measuring the variation in a continuous

response variable (e.g. weight, yield etc) in terms of its sum of squared deviations from the

sample means. It then partitions this variation into explained and unexplained (residual)

components. Finally it compares these partitions to ask how many times more variation is

explained by differences between samples than by differences within samples.

Most ways of measuring variation would not allow partitioning, because the variation in the

components would not add up to the variation in the whole. We use ‘sums of squares’ because

they do have this property. We get the explained component of variation from the sum of squared

Lecture notes: One-way Analysis of Variance

C. P. Doncaster 2

deviations of sample means from the global mean. Then we get the unexplained component of

variation from the sum of squared deviations of variates from their sample means. These two

components together account for the total variation, which can be obtained from the sum of

squared deviations of variates from the global mean.

Let’s see how it works in practice. Say we have sampled a woodland population of wood mice,

and found the average weight of adult males is 25 g, and the average of adult females (not

gestating) is 17 g. But both sexes vary quite widely around these means, and some males are

lighter than some females. We want to know whether our samples just reflect random variation

within an undifferentiated population, or whether they illustrate a real difference in weight by

sex.

The problem is illustrated below with an ‘interval plot’ produced by R. It shows male and female

means and their 95% confidence intervals. This is a common way of summarising averages of a

continuous variable. The vertical lines cover the range of possible values for each population

mean, with 95% confidence. You will see how they are derived in the practical, but we use them

here to illustrate the extent of variation within each sample.

The confidence intervals overlap, reflecting the fact that some females were heavier than some

males. We do an Analysis of Variance to test whether the sexes are really likely to differ from

each other on average in the population, despite this overlap in the samples. This involves

comparing the two sources of variation in weight: (i) the average variation between means for

each sex (this is the variation explained by the factor ‘Sex’), and (ii) the average variation around

each sample mean (this is the residual, unexplained variation). Together they add up to the total

variation, when variation is measured as squared deviations from means.

Box 1. Partitioning the sums of squares (supplementary information)

Why do explained and unexplained sources of variation add up to the total variation, when

variation is measured as squared deviations from means?

For any one score, Y -G is its deviation from the grand mean. If we measure variation as squared

deviations, then the total variation in our two samples is the sum of squares: ( Y -G )2.

However, each Y -G comprises two components: Y -Y is the deviation of the score from the

Lecture notes: One-way Analysis of Variance

C. P. Doncaster 3

mean for its sample i and therefore the component not explained by the factor ‘sex,’ whileY -G

is the deviation of the sample mean from the grand mean and therefore the explained component.

For example, a score of 28g for a particular male is 3g away from his sample meanY = 25g,

which compares to the deviation of 4g by which the sample mean differs from the global

meanG = 21g (i.e. the mean of the means for each sex: (25+17)/2).

We can use a vector to describe the deviation of each score in terms of the two independent

sources of variation (explained and unexplained).

We plot these deviations of any one of the scores to its sample meanY, on an axis perpendicular

to the one describing the deviation of the global meanG from the sample meanY. This is

because these two deviations are independent by definition: the horizontal component in the

graph is explained by the factor sex, and the vertical component is unexplained, residual

deviation.

The total deviation is then the resultant vector, i.e. the bold arrow in the graph below resulting

from the combination of these two independent sources of variation.

Y G

Y

Y

Response (explained component)

Err

or

(un

ex

pla

ine

d c

om

po

ne

nt)

The squared length of this vector equals the sum of the squares of the other two sides (vertical

and horizontal arrows: Pythagoras’s theorem). So if we represent variation as squared deviations,

the variation for each score partitions into the two independent sources: the explained (Y -G )2,

and the unexplained ( Y -Y )2. We could attach such vectors to all our scores, and the sum of all

these increments then gives the total squared deviations in terms of the explained variation added

to the unexplained variation: ( Y -G )2 = (Y -G )2 + ( Y -Y )2.

If the average squared deviation ofG fromY is big compared to the average squared deviation

of Y fromY, then we could conclude that most of the total variation is explained by differences

between the sample means. This is exactly the procedure adopted by Analysis of Variance.

How to do a one-way Analysis of Variance

Let’s do this very simple Analysis of Variance on the two samples of adult wood mice. We want

to know if there is any difference between the body weights of males and females that cannot be

attributed to sampling error.

Design: Firstly it is very important to have designed a method of data collection that will allow a

sample to represent the population that we are interested in. Whatever the method, it must allow

subjects to be picked at random from the population. So if our male sample is going to comprise

5 individuals, they should not all be brothers, or all taken from the same patch of wood. [In the

practical you will look at an experimental analysis, of the effect of different pesticides on

hoverflies; you will then have experimental plots in place of individuals, and the important

Lecture notes: One-way Analysis of Variance

C. P. Doncaster 4

design consideration will be to allocate the different treatments (of pesticide) at random to the

experimental plots.]

Analysis: Having collected our samples, we then weigh all the males and all the females, and

calculate mean weights for each sample, and a grand (i.e. total or pooled) mean weight. These

data have been put into a spreadsheet, which is shown in Fig. 2 below. They will allow us to test

the null hypothesis, H0: There is no difference between the sample means.

Fig. 2. Data on body weights of male and female wood mice, as they look in an Excel spreadsheet.

Each score can now be tagged with the following information:

1. Its sample mean (column D);

2. The grand mean (col E);

3. The squared deviation of the sample mean from the grand mean (col F), which equals the

component of variation for this score that is explained by the independent variable ‘sex’;

4. The squared deviation of the score from the sample mean (col G), which equals the

component of unexplained variation for this score;

5. The squared deviation of the score from the grand mean (col H), which equals the component

of total variation.

Columns F, G, and H are then summed to find their ‘Sums of Squares’, which define the

variation from explained and unexplained sources, and the total variation:

We are interested in comparing the average explained variation with the average unexplained

(error) variation, and we get these averages from the ‘Mean Squares’:

These Mean Squares measure the explained and unexplained variances in terms of variability per

degree of freedom. Finally, the F-statistic is obtained from the ratio of these two Mean Squares:

Lecture notes: One-way Analysis of Variance

C. P. Doncaster 5

Interpretation: The F statistic is the ratio of average explained variation to average unexplained

variation, and a large ratio indicates that differences between the sample means account for much

of the variation of scores from the grand mean score. We can look up a level of significance in

tables of the F-statistic. In this example, for 1 and 8 degrees of freedom, the critical 5% value is

5.32. Since our calculated value exceeds this, we can draw the following conclusion: “Body

weights differ between males and females in the sampled population (F1,8 = 7.27, p < 0.05)”.

This is the standard way to present results of Analysis of Variance. Whenever presenting

statistical results, always give the degrees of freedom that were available for the test, so the

reader can know how big your samples were. For any Analysis of Variance this means giving

two sets of degrees of freedom.

What are degrees of freedom?

General rule: The F-ratio in an Analysis of Variance is always presented with two sets of degrees

of freedom. In a one-way test, the first corresponds to one less than the a samples or levels of the

explanatory variable (a - 1), and the second to the remaining error degrees of freedom (n - a).

For both sets, the degrees of freedom equals the number of bits of information that we have,

minus the number that we need in order to calculate variation. Think of degrees of freedom (d.f.)

as the numbers of pieces of information about the ‘noise’ from which an investigator wishes to

extract the ‘signal’. If you want to draw a straight line to represent a scatter of n points, you need

two pieces of information: slope and intercept, in order to define the line (i.e. you need n 2);

the scatter about the line (are all the points on it, or are they scattered or curved from it?) can then

be measured with the remaining n - 2 degrees of freedom. This is why the significance of a

regression is tested with a student’s t with n - 2 d.f. Likewise, when looking for a difference

between two samples, a Student’s t is tested with n - 2 d.f. because one d.f. is required to fix each

of the two sample means.

In Analysis of Variance, the first set of degrees of freedom refers to the explained component of

variation. This takes size a – 1, because we have a sample means and we need 1 grand mean to

calculate variation between these means. The second set of degrees of freedom refers to the

unexplained (error) variation. This takes size n – a, because we have n data points and we need a

sample means to calculate variation within samples.

Thus we calculate the average variance of sample means around the grand mean from the sum of

squared deviations ofY fromG, divided by one less than the a samples (= 1 for the wood mice).

Then we can deduce the average error variance from the sum of squared deviations of Y fromY,

divided by the remaining n - a degrees of freedom (= 8 in the wood mouse example).

Degrees of freedom are very important because they tell us how powerful our test is going to be.

Look at the table provided of critical values of F-distribution (p. 59). With few error d.f. (the

rows), the error variation needs to be many times smaller than variation between groups before

the ratio of to MS is big enough that we can be confident of a difference between

groups in the population from which we took samples for analysis.

This is particularly true when comparing between few samples. For example, if we want to

compare two samples each of 3 subjects, then the two sample means take 2 pieces of information

from the 6 subjects, leaving us with 4 error d.f. A significant difference at P < 0.05 then requires

that the average variation between samples is more than 7.71 times greater than the average

residual variation within each sample (as opposed to > 5.32 for the 2 samples of wood mice each

with 5 subjects: Appendix 7).

Lecture notes: One-way Analysis of Variance

C. P. Doncaster 6

Assumptions of Analysis of Variance:

The Analysis of Variance is run on samples taken from a population of interest, which means it

must assume: random sampling, independent residuals, normally distributed residuals, and

homogenous variances. We examine these 4 assumptions with a real example in the practical.

1. Random sampling is a design consideration for all parametric and non-parametric analyses. If

we had some a priori reason for wanting male mice to be heavier on average than females,

perhaps to bolster a favoured theory, then we might be tempted to choose larger males as

‘representatives’ of the male population. Clearly this is cheating, and only bolsters a circular

argument. Random sampling avoids this problem.

2. Independence is the assumption that the residuals (or ‘errors,’ the squared deviations of scores

from their sample means) should be independently distributed around sample means. In other

words, knowing how much one score deviates from its sample mean should not reveal anything

about how others do. Statistics only work by the accumulation of pieces of evidence about the

population, no one of which is convincing in itself. In combining these increments it is

obviously important to know that they are independent, and you are not repeatedly drawing on

the same information in different guises. This is true for both parametric and non-parametric

tests, and it is one of the biggest problems in statistical analysis for biologists.

If the wood mouse data came from sampling a wild population, some individuals may be caught

several times (if they get released back into the population after weighing). But clearly 5

measures repeated on the same individual do not provide the same amount of information as one

measure on each of 5 different individuals. This problem is called ‘pseudo-replication’ and leads

to the degrees of freedom being unjustly inflated. Analysis of variance can be conducted on

repeated measures, but it requires declaring ‘Individual’ as a second factor, and this adds extra

complications and assumptions - avoid it if at all possible!

Equally if most males came from one locality and most females from another, then we may be

seeing habitat differences not sex differences (i.e. the weights within each sample are not

independent, but depend on habitat). This problem is referred to as the ‘confounding’ of two

factors because their effects cannot be separated.

3. Homogeneity of variances is the assumption that all samples have the same variation about

their means, so the analysis can pertain just to finding differences between means. Violation of

this assumption is likely to obscure true differences. It can often be met by transforming the data

(see section on statistical modelling). See the practical exercise on page 14 for the R command to

perform a Bartlett’s test of homogeneity of variances.

4. Normality is the assumption that the residuals are normally distributed about their sample

means. We have seen how Analysis of Variance only makes use of two parameters to describe

each sample: the mean and the average squared deviations (the variance). A normal distribution

is a symmetrical distribution of frequencies defined by just these two parameters, so if the scores

are normally distributed around their sample means, then the data will be adequately represented

in the Analysis of Variance test. But if the distribution of scores is skewed, or bounded within

fixed limits (e.g. body weights can extend upwards any amount but cannot fall below zero), then

the mean may not represent the true central tendency in the data, and the squared deviations may

be an unreliable indicator of variance. In such cases, it is often necessary to transform the data

first (see pp. 34-35). See the practical exercise on page 14 for the R command to perform a

Shapiro-Wilk normality test on the residuals.

When using any statistic (parametric or non-parametric), you should do visual diagnostic tests to

check its assumptions. This applies also to Analysis of Variance, and in R you can do it with a

command of the sort: plot(aov(y ~ x)).

Lecture notes: One-way Analysis of Variance

C. P. Doncaster 7

Summary of parameters for estimating the population mean

Whenever you collect a sample of measurements, you will want to summarise its defining characteristics. If the data are approximately normally distributed around

some central tendency, and many types of biological data are, then three parametric statistics can provide much of the essential information. The sample mean,Y,

tells you what is the average measurement from your sample; the standard deviation (SD) tells you how much variation there is in the in the data around the sample

mean; the standard error (SE) indicates the uncertainty associated with viewing the sample mean as an estimate of the mean of the whole population, .

Parameter Description Example

1. Variable A property that varies in a measurable way between subjects

in a sample.

Weight of seeds of the Princess Bean Phaseolus vulgaris (in:

Samuels, M.L. 1991. Statistics for the Life Sciences. Macmillan).

2. Sample A collection of individual observations selected by a

specified procedure. In most cases the sample size is given

by the number of subjects (i.e. each is measured once only).

A sample of 25 Princess Bean

seeds, selected at random from the

total production of an arable field.

WEIGHT (mg)

343,755,431,480,516,469,69

4,659,441,562,597,502,612,

549,348,469,545,728,416,53

6,581,433,583,570,334

3. Sample mean

Y

The sum of all observations in the sample, divided by the

size of the sample, n. The sample mean is an estimate of the

population mean, (‘mu’) which is one of two parameters

defining the normal distribution (the other is , see below).

The sample meanY = nYn

i

i1

= 526.1 mg.

This comes from a population, the total production of the field,

which follows a normal distribution and has a mean = 500 mg.

4. Sum of squares,

SS The squared distance between each data point () and the

sample mean, summed for all n data points. The sample sums of squares SS

n

i

i YY1

2)(

5. Variance,

The variance in a normally distributed population is

described by the average of n squared deviations from the

mean. Variance usually refers to a sample, however, in which

case it is calculated as the sum of squares divided by n-1

rather than n.

The sample variance = 1n

SS

6. Sample standard deviation,

SD,

s

Describes the dispersion of data about the mean. It is equal to

the square root of the variance. For a large sample size,Y =

, and the standard deviation of the sample approaches the

population standard deviation, (‘sigma’). It is then a

property of the normal distribution that 95% of observations

will lie within 1.960 standard deviations of the mean, and

99% within 2.576.

The sample standard deviation s = (variance) = 113.7 mg.

The standard deviation of the population from which the sample

was drawn is = 120 mg.

Lecture notes: One-way Analysis of Variance

C. P. Doncaster 8

Parameter Description Example

7. Normal distribution A bell-shaped frequency distribution of a continuous

variable. The formula for the normal distribution contains

two parameters: the mean, giving its location, and the

standard deviation, giving the shape of the symmetrical

‘bell’. This distribution arises commonly in nature when

myriad independent forces, themselves subject to variation,

combine additively to produce a central tendency. Many

parametric statistics are based on the normal distribution

because of this, and also its property of describing both the

location (mean) and dispersion (standard deviation) of the

data. Since dispersion is measured in squared deviations

from the mean, it can be partitioned between sources,

permitting the testing of statistical models.

The weights of Princess Bean

seeds in the population follows a

normal distribution (shown in the

graph, with frequency on the

horizontal axis). Some 95% of the

seeds are within 1.96 standard

deviations of the mean, which is

1.96 = 500 235 mg.

8. Standard error of the mean,

SE

Describes the uncertainty, due to sampling error, in the mean

of the data. It is calculated by dividing the standard deviation

by the square root of the sample size (SD/n), and so it gets

smaller as the sample size gets bigger. In other words, with a

very large n, the sample mean approaches the population

mean. If random samples of n measurements were taken from

any population (not necessarily normal) with mean and

standard deviation , the mean of the sampling distribution

ofY would equal the population mean . Moreover, the

standard deviation of sample means around the population

mean would be given by /n.

The standard error of the mean n

SDSE = 22.74.

9. Confidence interval for Regardless of the underlying distribution of data, the sample

means from repeated random samples of size n would have a

distribution that approached normal for large n, with 95% of

sample means at ±1.960. With only one sample meanY

and standard error SE, these can nevertheless be taken as best

estimates of the parametric mean and standard deviation of

sample means. It is then possible to compute 95% confidence

limits for atY ±1.960SE (for large sample sizes). For small

sample sizes, The 95% confidence limits for are computed

at.

The 95% confidence intervals for from the sample of 25

Princess Bean seeds are at

[0.05]24Y t SE .

The sample is thus representative of the population mean, which

we happen to know is 500 mg. If we did not know this, the sample

would nevertheless lead us to accept a null hypothesis that the

population mean lies anywhere between 479.05 and 573.15 mg.

Practical: One-way Analysis of Variance

C. P. Doncaster 9

PRACTICAL : CALCULATING ONE-WAY ANALYSIS OF VARIANCE

Rationale

Analysis of variance is one of the most commonly used tests in biology, because biologists often

want to look for differences in mean responses between groups. Do male and female shrews

differ in body weight? Does crop yield differ with different concentrations of a fertiliser? Does

crop yield vary with rainfall? To find out whether shrews from a population of interest differ in

size between the sexes you could perform a t-test on samples from the population. This is a

simplified type of Analysis of Variance suitable for just two samples (males and females), and it

gives exactly the same statistical prediction. The Analysis of Variance comes into its own when

you are seeking differences between more than two samples. You would use Analysis of

Variance to find out if crop yield differs with three or more different concentrations of fertiliser.

You would also use the same method of Analysis of Variance to test the effect on crop yield of a

continuous variable such as rainfall, in which case you are testing whether rainfall has a linear

effect on yield (from a single sample rather than comparing between two or more samples).

In this practical you will perform an Analysis of Variance by hand, in order to see how it works.

This practical is designed to help you to interpret the output from statistical packages such as R,

which does most of the number crunching for you. Here is the scenario…

You have just graduated from University found employment with the Mambotox consultancy.

Mambotox is funded by outside contracts to evaluate the environmental impact of agricultural

chemicals. Its speciality is testing the effects of pesticides on non-target insects, spiders and

mites that are the natural enemies of crop pests (and hence useful to farmers as biological control

agents). Your first job with this company is to perform an experiment to compare the effects on

hoverflies of three new brands of pesticide designed to target aphids. Aphids are a major pest of

crops, but hoverflies are useful because their larvae are voracious predators of aphids. So an

efficient pesticide that also kills hoverflies may be no better in practise than a less efficient one

that does not.

To do the test you randomly allocate the three pesticides to plots of wheat which have all been

seeded with the same number of hoverfly larvae. After applying the treatments, you sample the

plots for surviving hoverfly larvae. You want to know whether the pesticide treatments influence

the survival of hoverfly larvae. This problem calls for an Analysis of Variance.

The hypothesis

Take a look at your data set at the top of page 13. It shows that each of the three treatments (Zap,

GoFly and Noxious) was applied to five replicate plots; the scores are the number of hoverfly

larvae counted in each replicate after treatment. The null hypothesis, H0, is that the mean scores

do not differ between treatments, i.e. that mean(Zap) = mean(GoFly) = mean(Noxious) in the

sampled population. The alternative hypothesis is that the population means are not all equal.

Analysis of Variance will allow you to test H0 and to decide whether it should be rejected in

favour of the alternative hypothesis.

Start to fill out the cells of the table beneath the data, by summing the scores for each

treatment and dividing each sum by its sample size to obtain the group means. That is what is

meant by the expression:

Practical: One-way Analysis of Variance

C. P. Doncaster 10

j

n

i

ijj nYYj

1

i.e. Group mean = sum of scores in group / number of scores in group

You can read the formula as follows: The mean (denotedY ) for each treatment j is equal to the

sum (‘‘) of i scores for that treatment (‘’) for i = 1 to , divided by , which is the sample

size (and for each of these treatments it equals 5 plots).

One of the means is rather larger than the others. How do we know if the differences between the

means are due to the pesticide treatments or because of random variation? It might be that

random differences between the 15 plots is enough to explain the higher mean value under one

treatment. This is precisely the null hypothesis that is tested by Analysis of Variance.

Analysing variance from the sums of squares

Analysis of Variance finds out what causes the individual scores to vary from the grand mean of

all the n = 15 plots. If you calculate this grand mean you should get a value of 9260/15 = 617.33.

None of the scores actually equals this grand mean, and their deviations from it are explained by

two possible sources of variation. The first source of variation is the pesticide treatment (Zap,

GoFly or Noxious). If Zap kills fewer hoverfly larvae, then we would expect plots treated with

Zap to have higher scores in general than plots treated with the other pesticides. The second

source of variation is due to differences among plots, which can be seen within each treatment.

The way we measure total variation for an Analysis of Variance is by summing up all the

squared differences from the grand mean. This is called the ‘total sum of squares’ or ‘SS’:

a

j

n

i

totalijtotal

j

YYSS1 1

2

The above expression means: SS is obtained by subtracting the grand mean

(denoted) from each score (‘’ denoting the ith score

in the jth treatment) and squaring this difference, then summing these squares for all scores in

each treatment and all a treatments. Do this, and keep a note of the value you get.

The reason for squaring each difference is that we can then separate this total variation into its

two sources: one due to differences between treatments (called the ‘sum of squares between

groups’, or ‘SS’), and one due to the normal variation between plots (the ‘error sum of

squares’, or ‘SS’). Then it is a very useful property of squared differences that:

SS = SS + SS.

Note that the word ‘error’ here does not mean ‘mistake’, but is a term describing the variation in

scores that we cannot attribute to a specific variable; you may also see it referred to as ‘residual’.

Calculate these sums of squares and put the values in the right-hand column of the table

below. Do this by first calculating the between group sums of squares for each treatment in turn:

21

2

)( totaljj

n

i

totaljjgroup YYnYYSSj

In other words, for each treatment j, square the difference between the group mean and the grand

mean and multiply by the sample size. Then add the three results together to get the overall

variation between group means: SS and put this value in the right-hand column. Now

calculate the error sums of squares for each treatment in turn:

Practical: One-way Analysis of Variance

C. P. Doncaster 11

jn

i

jijjerror YYSS1

2

)(

In other words, square the difference between each score and its group mean, and sum these

squares. Then add the three group sums to get the overall variation within groups: SS and put

this in the right-hand column. Finally, add SS to SS to get SS, and put it in the right-hand

column. Does this total equal the value that you obtained from the sum of all squared deviations

from the grand mean? It should, showing how total variance can be partitioned into its sources.

The F-value

It is intuitively reasonable to think that if we get a large variation between the group means

compared to variation within the groups, then the means could be considered to differ between

groups because of real differences between the pesticides (rather than because of residual

variation). This is the comparison that the F-value makes for us. It takes the average sum of

squares due to group differences (called the ‘group mean square’ or MS) and divides it by the

average sum of squares due to subject differences (the ‘error means square’ or MS):

anSS

aSS

MS

MSF

error

group

error

group

1 where a = number of groups, and n = total of 15 plots.

Calculate these mean squares, and add them into the right-hand column. Finally, calculate F.

This ratio will be large if the variation between the groups is large compared to the variation

within the groups. But the value of F will be close to unity for a true null hypothesis, of no

variation due to groups. Just how far above F = 1.00 is too much to be attributable to chance is a

rather complicated function of the number of groups and the number of plots in each group.

Tables of the F statistic will give us this probability based on the degrees of freedom for the

between group variation (a - 1 for a groups or treatments) and the degrees of freedom for the

within group variation (n - a ), or it will be provided automatically by statistical packages.

Use the published table provided for you in Appendix 7 to find the critical value for the upper

5% point of the F-distribution with the appropriate degrees of freedom (denoted v1 and v2 in the

table). The columns of the table give a range of possible degrees of freedom for the group mean

square, which is equal to a -1. The rows of the table give a range of possible degrees of freedom

for the error mean square, which is equal to n - a. Is your calculated value of F greater than this

critical value? If so, you can reject the null hypothesis with < 5% chance of making a mistake in

so doing. In the report of your analysis you would say “pesticide treatments do differ in their

effects on hoverfly numbers: = #.##, p < 0.05” substituting in the values of v1 and v2 and the

calculated F to 2 decimal places. Put this conclusion in the final row of your analysis.

Using a statistical package

Let’s compare the calculations you have been doing laboriously by hand with the output from

a statistical package. Read the same dataset into R, using the format shown on page 14. Now run

an Analysis of Variance in RStudio with the suite of commands on page 14. You should get the

same result as you got from the calculation by hand. Make sure you understand this output in

terms of the calculations you have been doing. When you use statistical packages such as R, you

will need to comprehend what the output is telling you, so that you can be sure it has done what

you wanted. For example, it is always a good idea to check that the output shows the correct

Practical: One-way Analysis of Variance

C. P. Doncaster 12

numbers of degrees of freedom. If it is not showing the degrees of freedom that you think it

should, then the package has probably tried to analyse your data in a different way from that

intended, so you would need to go back and check your input commands.

Having done the analysis in RStudio, you can now plot means and their confidence intervals

with two additional lines of R code, which call a script of plotting instructions and then run it:

source(file="http://www.southampton.ac.uk/~cpd/anovas/datasets/PlotMeans.R")

plot_means(aovdata$Trtmnt, aovdata$Score, "Treatment", "Score", "CI")

The 95% confidence intervals around the jth mean are at 1.96j j jY s n , where sj is the

sample standard deviation:

11

2

j

n

i

jijj nYYsj

The reason for this is that 95% of normally distributed data lie within 1.96 standard errors of the

mean, by definition, and the standard error is given by the term j js n . Which of the pesticides

can you recommend to farmers? The correct answer is none yet, until you have checked the

assumptions of the analysis.

Underlying assumptions of Analysis of Variance

Any conclusions that you draw from this analysis are based on four assumptions. What are

they? Refer back to page 6 if necessary.

1. The first assumption is that the plots are assigned treatments at random, which was indeed a

design consideration when you carried out the experiment.

2. The second assumption is that the residuals should be independently distributed, so they

succeed each other in a random sequence and knowing the value of one does not allow you to

predict the value of another (i.e. they truly represent unexplained variation). This is the

‘assumption of independence,’ which is a matter of declaring all known source of variation. In

this case, any variation not due to treatment contributes to the MSerror, and we assume it

contains no systematic variation (e.g., due to using different fields for different treatments).

The other assumptions concern the distribution of the error terms (residuals): . Use R to

test for these by using the commands on page 14.

3. The residuals should be identically distributed for each treatment, so all the groups have

similar variances. This is because the error mean square used to calculate F is obtained from

the pooled errors around each group mean. Since the analysis is only seeking differences

between means, it assumes all else is equal. This is the ‘assumption of homogeneity of

variances,’ which is visualised with the graph of residuals versus fitted values (funnel shaped

if heterogeneous), and also by the slope of a scale-location graph (non-zero if heterogeneous).

4. Finally, the should be normally distributed about the group means, because the sums of

squares that we use to calculate variance will only provide a true estimate of variance if these

residuals are normally distributed. This is the ‘assumption of normality,’ which is visualised

by the normal Q-Q plot. The plot should follow an approximately straight diagonal; bowing

indicates skew (to right if convex) and an S-shaped indicates a flatter than normal distribution.

There are various statistical methods of putting probability limits on the likelihood of your

residuals meeting each of these assumptions. We will not go into them here, but they are

described in any text book of statistics. Having visually checked the assumptions, which of the

pesticides can you recommend to farmers?

Practical: One-way Analysis of Variance

C. P. Doncaster 13

The data:

PESTICIDE

Zap GoFly Noxious

700 480 500

850 460 550

820 500 480

640 570 600

920 580 610

The Analysis of Variance:

Treatment group j

Zap GoFly Noxious Total

Sample sizes:

Sums of scores:

jn

i

ijY1

Means: j

n

i

ijj nYYj

1

totalY

a

j

totaljjgroup YYnSS1

2

+ + = d.f. =

a

j

n

i

jijerror

j

YYSS1 1

2

+ + = d.f. =

SS SS SStotal group error

MS SS agroup group 1

anSSMS errorerror

F MS MSgroup error

Fcrit[ . ]0 05

Conclusion:

Practical: One-way Analysis of Variance

C. P. Doncaster 14

Analysis of Variance in R

For this part, refer to the ‘Using RStudio – Help Guide’ on Blackboard. Type the

data into a new text file called ‘Score-by-pesticide.txt’, separating each score

from its treatment level by a tab. Then read this file into a ‘data frame’ in R and

perform the analysis in RStudio with the following suite of commands:

# 1. Prepare the data frame 'aovdata'

aovdata <- read.table("Score-by-pesticide.txt", header = T)

attach(aovdata) # Access the data frame

Trtmnt <- factor(Trtmnt) # Set Trtmnt as a factor

# 2. Command for factorial analysis

summary(aov(Score ~ Trtmnt)) # Run the ANOVA

bartlett.test(Score ~ Trtmnt) # Test for homogenous variances

shapiro.test(resid(aov(Score ~ Trtmnt))) # Test for normality

# 3. Plot data and residuals

par(cex = 1.3, las = 1) # Enlarge, orient plot labels

plot(Trtmnt, Score, xlab="Pesticide", ylab="Score") # Box plot

par(mfrow = c(2, 2)) ; plot(aov(Score ~ Trtmnt)) # 4 residual plots

par(mfrow = c(1, 1)) ; detach(aovdata) # Reset plot window; detach data frame

The ‘summary’ and ‘plot’ commands will give the following outputs:

Df Sum Sq Mean Sq F value Pr(>F)

Trtmnt 2 215613 107807 16.78 0.000334 ***

Residuals 12 77080 6423

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

From the ANOVA table, you conclude that the

treatment types differ in their effects on survival of

hoverfly larvae (F2,12 = 16.78, P < 0.001). The

ANOVA tells you nothing more than this. You then

interpret where the difference lies from the box plot

(showing median, first and third quartiles, and

max/min values up to ~2 s.d.; any outliers would be

plotted individually). The first two of four residuals

plots are shown below. Residuals versus fitted

(mean) response visualizes any heterogeneity of

variances. Residuals versus theoretical (normal)

quantiles visualises any systematic deviation from

normal expectation given by the diagonal line.

These plots show no detectable increase in heterogeneity with the mean (Bartlett’s K22 = 2.63, P

= 0.27, and no systematic deviation from normality (Shapiro-Wilk W = 0.96, P = 0.75).

Lecture notes: Two-way Analysis of Variance

C. P. Doncaster 15

LECTURE: TWO-WAY ANALYSIS OF VARIANCE

We have used one-way Analysis of Variance to test whether different treatments of a single

factor have an effect on a response variable (finding a treatment effect: F1,12 = 16.78, P < 0.001).

With two-way Analysis of Variance, we divide the samples in each treatment into sub-samples

each representing a different level of a second factor. A hypothetical example illustrates what the

analysis can reveal about the response variable.

Example of two-way Analysis of Variance: factorial design

In the following experiment, we wish to test the efficacy of different systems of speed reading,

and to know whether males and females respond differently to these systems. We randomly

assign 30 subjects (S1…S30) to three treatment groups: T1, T2 and T3, with 10 subjects per

treatment of which 5 are male and 5 female. The three groups are each tutored in a different

system of speed reading. A reading test is then given and the number of words per minute is

recorded for each subject. The data are presented in a design matrix like this:

Table 1. Design matrix for factorial Analysis of Variance.

SYSTEM

T1 T2 T3

SEX

Male Y1, ... Y5 Y11, ... Y15 Y21, ... Y25

Female Y6, ... Y10 Y16, ... Y20 Y26, ... Y30

The table thus has 6 data cells, each containing the responses of 5 independent subjects (here

coded Y1, ... Y5 etc). This is a ‘factorial design’ because these six cells represent all treatment

combinations of the two factors SEX and SYSTEM. Because each cell contains the same number

of responses, we call this a ‘balanced design,’ and because each level of one factor is measured

against each level of the other, it is also an ‘orthogonal’ design. [See page 31 for cross-factored

Analysis of Variance on unbalanced data.].

A two-way Analysis of Variance will give us three very useful pieces of information about the

effects of the two factors:

1. Whether mean reading speeds differ between the three techniques when responses of males

and females are pooled, indicated by a significant F for the SYSTEM main effect;

2. Whether males and females have different reading speeds when responses for the three

systems are pooled, indicated by a significant F for the SEX main effect;

3. Whether males and females respond differently to the techniques, indicated by a significant F

for the SEX:SYSTEM interaction effect.

We get these three values of F from five sources of variation: the n scores themselves, the a cell

meansY, the r row meansR, the c column meansC, and the single global meanG.

Lecture notes: Two-way Analysis of Variance

C. P. Doncaster 16

Table 2. Component means for the factorial design.

SYSTEM Row

T1 T2 T3 Means

Male Y Y Y R

Female Y Y Y R

Column

means C C C G

The R analysis of real data is shown below, producing the interaction plot above. The output

contains the three values of the F-statistic and their significance. The rest of this section is

devoted to explaining just how the means in the table above can lead us to the inferences in the

analysis below – that sex and system both have additive effects on reading speed, with no

interaction between them.

# Prepare data frame ‘aovdata’

aovdata<-read.table("System-by-sex.csv",sep=",",header=T)

attach(aovdata)

# Classify factors and covariates:

sex <- as.factor(sex) ; system <- as.factor(system)

# Specify the model structure:

summary(aov(speed ~ sex*system))

Df Sum Sq Mean Sq F value Pr(>F)

sex 1 25404 25404 5.716 0.025 *

system 2 503215 251608 56.616 8.19e-10 ***

sex:system 2 2817 1408 0.317 0.731

Residuals 24 106659 4444

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

# Interaction plot:

interaction.plot(

sex, system, speed,

xlab = "Sex", ylab = "Speed", trace.label = "System",

las = 1, xtick = TRUE, cex.lab = 1.3

)

# Test for homogeneity of variances

bartlett.test(speed ~ interaction(sex, system))

Bartlett test of homogeneity of variances

data: speed by interaction(sex, system)

Bartlett's K-squared = 9.8486, df = 5, p-value = 0.07964

# Test for normality of residuals

shapiro.test(resid(aov(speed ~ sex*system)))

Shapiro-Wilk normality test

data: resid(aov(speed ~ sex * system))

W = 0.97261, p-value = 0.6127

detach(aovdata)

Lecture notes: Two-way Analysis of Variance

C. P. Doncaster 17

Using a statistical model to define the test hypothesis

In defining the remit of our analysis, we want to make a statement about the hypothesised

relationship of the effects to the response variable, and this can be done most concisely by

specifying a model. In the one-way Analysis of Variance that you conducted in the practical, you

tested the model:

HOVERFLIES = PESTICIDE +

The ‘=‘ does not signify a literal equality, but a statistical dependency. So the statistical analysis

tested the hypothesis that variation in the response variable on the left of the equals sign

(numbers of hoverflies) is explained or predicted by the factor on the right (pesticide treatments),

in addition to a component of random variation (the error term , ‘epsilon’). This error term

describes the residual variation between the plots within each treatment. We could have written it

out in full as ‘PLOTS(PESTICIDE)’ meaning the variation between the random plots nested

within the different types of pesticide (‘nested’ because each treatment has its own set of plots).

The Analysis of Variance tested whether much more of the variation in hoverfly numbers falls

between the categories of ‘Zap’, ‘GoFly’ and ‘Noxious’, and so is explained by the independent

variable PESTICIDE, than lies within each category as unexplained residual variation, =

PLOTS(PESTICIDE). This was accomplished by calculating the ratio:

Pesticide effect: )(' PESTICIDEPLOTS

PESTICIDE

error

group

MS

MS

MS

MSF

For our two-way experimental design, we can also partition the sources of variance. This time the

sources partition into two main effects plus an interaction, and the residual variation within each

sex and system combination. The full model statement looks like this:

SPEED = SEX + SYSTEM + SEX:SYSTEM + SUBJECTS(SEX:SYSTEM)

The four terms on the right of the equals sign describe all the sources of variance in the response

term on the left. The last term describes the error variation, . It is often not included in a model

description because it represents residual variation unexplained by the main effects and their

interaction. But it is always present in the model structure, as the source of random variation

against which to calibrate the variation explained by the main effects and interaction. With this

model, we can calculate three different F-ratios:

Sex effect: 1

'( : )

group SEX

error SUBJECTS SEX SYSTEM

MS MSF

MS MS

System effect: 2

'( : )

group SYSTEM

error SUBJECTS SEX SYSTEM

MS MSF

MS MS

Sex:System interaction effect: int :

'( : )

eraction SEX SYSTEM

error SUBJECTS SEX SYSTEM

MS MSF

MS MS

Degrees of freedom

Before attempting the analysis, we should check how many degrees of freedom there are for each

of the main effects and the interaction, and how many error degrees of freedom. Remember that

degrees of freedom are given by the number of pieces of information that we have on a response,

minus the number needed to calculate its variation.

The SEX main effect is tested with 1 degree of freedom (one less than its two levels: male and

female), and the SYSTEM main effect with 2 degrees of freedom (one less than its three levels);

Lecture notes: Two-way Analysis of Variance

C. P. Doncaster 18

the SEX:SYSTEM interaction effect is tested with the product of these two sets of degrees of

freedom (i.e. 1 2 = 2 degrees of freedom). The error degrees of freedom for both effects and the

interaction comprise one less than the remaining numbers in the total sample of N = 30, which is

30-(1+2+2)-1 = 24. You can also think of error degrees of freedom as being N – a, which is the

number of observations minus the a = 6 sample means needed to calculate their variation.

Thus the significance of the SEX effect is tested with a critical F1,24, SYSTEM with F2,24 and the

SEX:SYSTEM interaction with F2,24.

General rule: In general for an Analysis of Variance on n subjects (Y) measured against two

independent factors X1 (the row factor in a design matrix such as Table 1) and X2 (the column

factor), with r and c levels (samples) respectively, the model has the following degrees of

freedom:

model: Y = X1 + X2 + X1:X2 + Y(X1:X2)

d.f.: r-1 c-1 (r-1)(c-1) N-rc

The reason why the error degrees of freedom are rc less than N is simply because rc is equal to

one more than the sum of all the main effect and interaction degrees of freedom. Thus the four

sets of degrees of freedom all add up to a total of N - 1 degrees of freedom.

In practise when you design an experiment or fieldwork protocol that will require Analysis of

Variance, you can use this knowledge to work out in advance how many subjects you need. You

will need rc degrees of freedom (e.g. 2 levels of sex times 3 of system = 6) just to define the

group dimensions, and then at least the same again to give you enough error degrees of freedom

for a reasonably powerful test.

How to do a two-way Analysis of Variance

A two-way analysis comprises a test of the model as a whole, and a test of the individual terms in

the model. Its degrees of freedom and sums of squares follow the same principles as the one-way

Analysis of Variance. The ‘Quantities’ column shows how the component sums of squares relate

to each other (with n defining the number of replicates in each of the rc samples):

Table 3a. Calculation of degrees of freedom and sums of squares for the two-factor model.

Source of variation d.f. SS Quantities

2 Within cells (error) rc(n -1)

3 Total rcn -1

Table 3b. Calculation of degrees of freedom and sums of squares for the terms in the model.

Source of variation d.f. SS Quantities

5 Between columns (System)

6 Interaction (Sex:System)

7 Within cells (error) rc(n -1)

8 Total rcn -1

Lecture notes: Two-way Analysis of Variance

C. P. Doncaster 19

These sums of squares allow us to calculate mean squares, MS, for components 1 to 2 and 4 to 7,

by dividing each SS by its degrees of freedom. Finally, we get one F-statistic for each of

components 4, 5 and 6, by dividing the row MS by the (from component 7). These are the

mean squares and F-statistics shown in the R output pictured earlier.

You do not need to learn the formulae in the table above, but you should be able to gain from

them an appreciation of how the total sums of squares are partitioned into the different sources.

Interpreting the results

When we did one-way Analysis of Variance we obtained a single F-statistic on which to base our

conclusions about the hypothesised relationship. The two-way analysis, however, gives three

different values of F, each telling us about different aspects of the hypothesised relationship.

A significant SEX:SYSTEM interaction would allow us to conclude that the techniques have

different effects on males and females. In the particular example we have in Fig. 1, the

interaction term is not significant (F2,24 = 0.32, p > 0.7), meaning that the effect of reading

technique on speed is not modulated by (does not depend on) sex. In other words, reading

technique influences speed in the same way for males and females. That would be the conclusion

from the R analysis shown above.

A significant SEX effect (F1,24 = 5.72, p = 0.025 in Fig. 1) means that males and females have

different mean speeds, irrespective of technique.

A significant SYSTEM effect (F2,24 = 56.62, p < 0.001) means that reading technique does

influence mean speeds, irrespective of sex.

How do we interpret the analysis if one or other of the main effects is not significant? If the

interaction effect is significant, but the SYSTEM effect is not, what does this tell us about the

different reading techniques? In general, if an interaction term is significant, then both of the

component effects must also be significant, because each one influences the effect of the other on

the response variable. We should therefore always report a significant interaction first, before

considering the main effects. Some graphical illustrations will help to explain why this is.

Using interaction plots to help interpret two-way Analysis of Variance

Take a look at the set of eight graphs on the next page. These are called ‘interaction plots’ and

they illustrate all eight possible ways in which a response variable can depend on two factors.

The idea is to plot the response variable against one of the independent effects (it does not matter

which one) and then plot on the graph the sample means for each level of the other independent

effect. For the sake of clarity, means are plotted without error bars, and we can assume that each

would have only a small residual variation above and below it.

For each type of SYSTEM (T1, T2 and T3), the mean response is plotted for each type of SEX

(male or female), and joined by a line. Thus the mid-point of each of these lines reveals the mean

reading speed for systems T1, T2 and T3, irrespective of any sex effects. You can guess roughly

where the mean reading speed is for each sex from the average height of the three points at each

sex.

Lecture notes: Two-way Analysis of Variance

C. P. Doncaster 20

Fig. 2. Interaction plots for two independent effects, illustrating the eight possible outcomes of a two-way Analysis of Variance.

SEX

M F

T1

T2

T3

SEX

M F

T1

T2

T3

Significant SEX effect

No significant SYSTEM effect

No significant interaction

No significant SEX effect

Significant SYSTEM effect

No significant interaction

SP

EE

D

SP

EE

D

1. 2.

SEX

M F

T1

T2

T3

SEX

M F

T1

T2

T3

Significant SEX effect

Significant SYSTEM effect

Significant interaction

Significant SEX effect

Significant SYSTEM effect

No significant interactionS

PE

ED

SP

EE

D

4.3.

SEX

M F

SP

EE

D

T1

T2

T3

SEX

M F

T1

T2

T3

Significant SEX effect

No significant SYSTEM effect

Significant interaction

No significant SEX effect

Significant SYSTEM effect

Significant interaction

SP

EE

D

6.5.

SEX

M F

SP

EE

D

T1

T2

T3

SEX

M F

T1T2T3

No significant SEX effect

No significant SYSTEM effect

No significant interaction

No significant SEX effect

No significant SYSTEM effect

Significant interaction

SP

EE

D

8.7.

Lecture notes: Two-way Analysis of Variance

C. P. Doncaster 21

Graph 1 in Fig. 2 shows three systems that do not differ in their effects on reading speeds, but

females out-perform males on average.

Graph 2 shows males and females doing equally well, but subjects learning system T1

outperforming those learning system T2 who do better than those learning system T3.

Graph 3 shows the same differences between systems, but females also doing better on

average than males under any of the systems. This is the result we actually obtained.

Graph 4 shows what a significant interaction effect looks like. The effects of system depend

on sex, with differences between the methods having a more pronounced effect on female

reading speeds than those of males. In other words, the system effect is modulated by sex (or

equally, the sex effect is modulated by system).

Graph 5 shows males and females with the same average reading speeds (as in graph 2), but

the system effect depends very much on sex, with T3 being best for males and T1 for females.

In graph 6, females do better than males on average. The mid-points of the lines all coincide at

the same score for the response variable, and so no differences are apparent between the

systems if we pool males and females. But the type of reading system clearly does have an

important influence on males, and an equally important - but different - influence on females.

Thus the significant interaction indicates a real effect of system, even though it was not

significant as a main effect.

In graph 7, neither sex nor system are significant as main effects, but their combined effect is.

The effects of technique are apparent only when the sexes are considered separately.

In graph 8, speed is not influenced by sex or system, either independently or interactively.

Only under this outcome would the null hypothesis be accepted, that neither factor has an

influence on reading speed.

Other types of two-way Analysis of Variance

So far we have only considered factorial designs, which have replicates in all combinations of

levels of both factors. If a two-factored design has no replication within each cell, then it will

not be possible to look for interaction effects, and they must be assumed to be negligible. The

‘Latin square’ is an example of this (read more about it in Sokal & Rohlf). It is used in

situations where a single main effect is being tested (say 4 types of fertiliser on crop yield),

but in the presence of a second ‘nuisance’ effect (e.g. a gradient of moisture on the slope of a

hill). The best way to deal with this situation is to lay out the plots in a structured pattern

(rather than random allocation):

Hill top A B C D

B C D A

C D A B

Hill bottom D A B C

Thus each of 4 levels of height have each of the 4 types of fertiliser (A-D), so it is a fully

orthogonal design. The test model is: ‘Response = Factor + Block’ meaning that the response

(yield) is to be tested against a main factor (pesticide) and a blocking variable (moisture), with

an error mean square being provided by the unexplained interaction Factor:Block.

Many other designs are possible. You might read about nested analyses, or three-way or

higher order factorials, but when designing your own data collection, try to avoid the need for

these, because greater sophistication always requires more stringent conditions.

C. P. Doncaster 22

Lecture notes: Regression

C. P. Doncaster 23

LECTURE: REGRESSION

We have seen how Analysis of Variance gives us the capacity to test for differences between

category means. For example, are males heavier on average than females in the sampled

population? Here the response variable is weight and the categories are the two sexes. Sometimes

however we want to measure the response variable against a continuous, instead of a categorical,

variable. If we want to know whether Weight varies with Age, we could divide the observations

into age categories (e.g. ‘juvenile’ and ‘adult’) and do an ANOVA, or we could measure Weight

on a continuous scale with Age. In the latter case we are asking whether Weight regresses with

Age. Specifically, we hypothesise that Weight shows a linear relationship to Age (we will treat

non-linear relationships later). The statistical model is the same in both cases, and it is tested

with Analysis of Variance in both cases. Only the degrees of freedom are different:

Model for Analysis of Variance by categories: Weight = Age +

d.f. for n data points and a categories: a-1 n-a

Model for Analysis of Variance by regression: Weight = Age +

d.f. for n data points: 1 n-2

Both models could be analysed with the ‘aov’ command in R, though the first one would require

identifying Age as a ‘factor’ (with the command: Age <- as.factor(Age)). Whether you do

the regression analysis with the ‘aov’ command or the ‘lm’ command in R, the same Analysis of

Variance will be done for you, giving an F-statistic with 1 and n-2 degrees of freedom.

Where do these regression degrees of freedom come from? The value of F is calculated from

MS[Age] divided by MS[]. For MS[Age] we have 1 d.f. because we have two pieces of

information with which to construct our regression line - the intercept and slope - and we need

one piece of information - the overall mean weight - in order to calculate whether the regression

varies from horizontal. For MS[] we have n-2 degrees of freedom because we have n pieces of

information - the data points - and we need two pieces - the intercept and slope - in order to

calculate the residual variation, given by the squared deviation of each observation from the line.

Let’s see how this works with an actual example. The following page shows a data set on new-

born badger cubs. Body weights in grams at different ages in days have been typed into a text file

and the response Weight regressed against the predictor Age. The ‘lm’ command in R has done

an Analysis of Variance on the 12 data points, giving 1 and 10 d.f.. This Analysis of Variance

tests the compatibility of the data with a regression slope of zero (i.e., a horizontal regression) in

the population of interest. The result of F1,10 = 3.90, P = 0.076 tells us that we have too high a

probability of a false positive (P > 0.05) to reject the null hypothesis of zero slope, and therefore

that weight does not co-vary detectably with age. The plot shows data points with homogeneous

variance across the range of Age, no obvious deviations from normally distributed residuals

around the regression line, and a linear relationship. The 95% confidence intervals in the plot

show that the regression slope could swivel to horizontal without passing outside them –

confirming our lack of confidence in the sampled population having a relationship of Weight to

Age.

How does the analysis arrive at this result? Look now at page 25, which shows an Excel file into

which the data have been typed. Here we see how the F-value was calculated.

As with the Analysis of Variance for a class predictor variable, the Analysis of Variance for a

continuous predictor variable partitions the squared deviations of the response variable into two

independent parts. These are the explained (or ‘regression’), and the unexplained (or ‘residual

error’), sums of squares, which together add up to the total squared deviations of the response

variable from its mean value. The Table on page 26 summarises the operations.

Lecture notes: Regression

C. P. Doncaster 24

# Linear regression in R on response of Weight to Age

# 1. Prepare the data frame ‘aovdata’

aovdata <- read.table("Weight-by-age.txt", header = T)

attach(aovdata) # Access the data frame

Age <- as.numeric(Age) # Set Age as ‘numeric’

# 2. Commands for regression analysis

model.1.1i <- lm(Weight ~ Age) # Analyse and store

summary(model.1.1i) # Print the results

Call:

lm(formula = Weight ~ Age)

Residuals:

Min 1Q Median 3Q Max

-144.058 -89.751 7.117 68.571 174.375

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 420.58 67.85 6.198 0.000102 ***

Age -18.22 9.22 -1.976 0.076392 .

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

Residual standard error: 110.2 on 10 degrees of freedom

Multiple R-squared: 0.2808, Adjusted R-squared: 0.2089

F-statistic: 3.904 on 1 and 10 DF, p-value: 0.07639

# 3. Plot the data

plot(Age, Weight,cex=1.5, las=1,

xlab="Age (days)", ylab="Weight (g)")

# 4. Add regression line and 95% confidence intervals

abline(coef(model.1.1i)) # add regression line

confint <- predict(model.1.1i, interval="confidence")

lines(Age, confint[,2], lty=2) # add lower c.i

lines(Age, confint[,3], lty=2) # add upper c.i

coef(model.1.1i) # print intercept and slope

(Intercept) Age

420.57576 -18.21678

# 5. Test assumptions

shapiro.test(resid(lm(Weight ~ Age))) # Normality of residuals

library(car);ncvTest(lm(Weight ~ Age))# Homogeneity of variance

Lecture notes: Regression

C. P. Doncaster 25

Lecture notes: Regression

C. P. Doncaster 26

This is how the terms are calculated in the Excel sheet on the preceding page:

Order Term Derivation Meaning of symbols

1. SSx ( x -x )² The sum of squared deviations of x from its mean, where x is

Age (column B) andx is mean age (cell b18).

2. SS(Total)

[or ‘SSy’] ( y -y )² The sum of squared deviations of y from its mean, where y is

Weight (Column F) andy is mean weight (cell F18).

3. SPxy ( x -x )( y -y ) The ‘sum of products’ of the deviations of x with y. Dividing

this by (n –1) gives the ‘covariance’.

4. Slope: b SPxy

SSx

Gradient of the regression line. A horizontal line has b = 0. A

positive gradient has b > 0, while negative has b < 0.

5. Intercept: a y - b x Calculated knowing regression line passes through x,y.

6. SS(Explained) (ŷ -y )² Explained sum of squared deviations, where: ŷ = a + bx. This

is the magnitude of the predicted deviation fromy.

7. d.f.(Explained) 2 - 1 = 1 We have two pieces of information (a and b) and we need one

piece (y ) to calculate the explained variation.

8. MS(Explained) SS(Explained)

d.f.(Explained)

Mean Square explained variation. The variance measured as

variability per degree of freedom.

9. SS(Error) ( y - ŷ )² Unexplained sum of squared deviations where ŷ = a + bx. This

is the magnitude of deviation from the predicted ŷ.

10. d.f.(Error) n - 2 We have n pieces of information (the values of y) and we need

two pieces (a and b) to calculate the error variation.

11. MS(Error) SS(Error)

d.f.(Error)

Mean square unexplained (residual error) variation. The

variance measured as variability per degree of freedom.

12. F MS(Explained)

MS(Error)

The ratio of explained to unexplained variances, to be

compared against tables of the F-distribution with 1 and n-2

degrees of freedom.

13. R ² SS(Explained)

SS(Total)

‘Coefficient of determination’ (often written r²). The

proportion of explained variation. If R ² = 1, all y lie on a

regression line for which b 0; if R ² = 0 then b = 0.

14. R SPxy

(SSxSSy)

‘Pearson Product Moment Correlation Coefficient’, r. Equal

in magnitude to the square root of the coefficient of

determination. Negative R means y tends to decrease with

increasing x.

Other terms in the R output:

‘t’-values are Student’s-t tests for departures of the intercept from zero, and the slope of Weight

with Age from zero. Note that the value of the Student’s-t test of the slope is equal to the square-

root of the value of F from the Analysis of Variance, and both significances are identical. This is

because both these tests are accomplishing exactly the same task.

‘Residual standard error’ is the square-root of the variance term given by MS(error).

‘Multiple R-squared’ is the coefficient of determination.

‘Adjusted R-squared’ is an adjusted coefficient of determination that is uninfluenced by the

number of d.f.

Lecture notes: Regression

C. P. Doncaster 27

The regression analysis on pages 24 and 25 works by partitioning the total variation in the

response variable into explained and unexplained parts. The total variation is obtained from

summing all the squared deviations of each weight value from the mean weight. The long arrow

on the graph on page 25 illustrates the portion of total variation contributed by just one

observation. The analysis will partition the total variation into its two components, illustrated by

the shorter arrows on the graph. One component is predicted by the regression line:

SS(Explained), while the other is the unexplained variation around the line: SS(Error). The

analysis will then calculate the average squared deviations of these two components, in order

finally to get from their ratio: MS(Explained) / MS(Error), the F-value with which to test the

significance of the regression.

The analysis proceeds in steps. First we find the regression line that will estimate values of y for

each of our values of x. With these predicted values, ŷ, we will then be able to sum their squared

deviations fromy in order to get the explained sum of squares: SS(Explained).

Steps 1 to 5 of the table on page 26. To find the regression line we must find values for two new

parameters: the slope of the line, b, and its intercept with the y-axis, a.

The slope b is calculated from the sum of products: SPxy = (x -x)(y -y) divided by the sum

of squared deviations in x: SSx = (x -x)2. The sum of products on the numerator tells us about

the covariance of y with x. It gives the slope a positive value if the coordinates for each data

point: x, y tend to be either both larger than their respective means: x, y or both smaller. The

slope will have a negative value if in general x <x when y >y, and vice versa. This formula for

b also means that the gradient of the slope will have a magnitude of one if, on average, each

deviation |y -y| has the same magnitude as each corresponding deviation |x -x|. If the deviations

in y are relatively greater than those in x, then the slope will be steeper than 1. The Excel sheet on

page 25 shows that the regression line on the graph has a gradient of –18.217, signifying that y is

predicted to decrease as x increases and that each decrease in y is predicted to be some 18 times

the corresponding increase in x.

The intercept a is calculated from a =y - b x. This is simply a rearrangement of the equation

for a straight line: y = a + bx. In this case we have known values for the two variables y and x, in

their respective sample meansy andx, and since we have just calculated b, we can now find the

unknown a.

With values for a and b, we have all the information we need to draw the regression line on the

graph. Excel can do this for us if we request ‘Add Trendline...’ from the ‘Chart’ menu. The result

is shown on the graph on page 25, and it accords with the equation: the line appears to intercept

the y-axis somewhere around 400 g, and the calculated a tells us it is exactly at y = 420.576 g.

Steps 6 to 8. With the two parameters b and a we can predict Weight, ŷ, for any given value of

Age, x. For each observed x we now calculate (ŷ -y )2 (column L of the Excel sheet) and sum

them to get the explained sum of squares: SS(Explained).

Steps 9 to 12. Finally, we need the unexplained sums of squares, which we get from the squared

deviation of each y from its predicted ŷ. The sum of all these (y - ŷ)2 (in column N of the Excel

sheet) is then the SS(Error). Now we calculate the mean squares: MS(Explained) and MS(Error),

and the F-statistic, in just the same way as for any other Analysis of Variance.

Steps 13 to 14. There remains one final parameter to calculate: the proportion of explained

variance, which is simply SS(Explained) / SS(Total). We call this fraction the ‘coefficient of

determination’ r2. Its square root is called the ‘Pearson product-moment correlation coefficient’

r. Step 14 of the Table on page 26 shows how r is calculated directly, which results in it having a

positive or negative value according to whether the regression is positive or negative.

C. P. Doncaster 28

Practical: Two-way Analysis of Variance in R

C. P. Doncaster 29

PRACTICAL: TWO-WAY ANALYSIS OF VARIANCE IN R

Do this analysis in RStudio (refer to the ‘Using RStudio – Help Guide’ on Blackboard). Prepare a

short report to the pharmaceutical company that makes the drug Ritalin, evaluating the utility of

their product (1 side A4). Divide your report into sections: an Introduction to explain the interest

in doing the test; Experimental Design and Analysis outlined briefly; Results, including the

ANOVA table showing Sums of Squares and Mean Squares etc, with an interpretation of the

analysis in the form: “the effect of the drug depended / did not depend on the condition of the

subject (F = #.##; d.f. = ##, ##; P = #.##) ... the main effect of treatment… etc.” Interpret the main

effects after the interaction. Include a fully annotated ‘interactions plot’. Finish with a short

paragraph of Conclusions about appropriate use of the drug.

Two-way Analysis of Variance

In the previous class practical you conducted a one-way Analysis of Variance. ‘One-way’ meant

that you were looking for differences between mean treatment effects for a single independent

factor (pesticide). Sometimes we are interested in responses to more than one independent factor,

and then it is possible to conduct an Analysis of Variance with two or more main effects. The

example below takes you through a two-way Analysis of Variance that you can perform for

yourself in R. It illustrates how analysis of two independent variables can yield informative

inferences. You may find that the output you get is easier to interpret after reading the

accompanying lecture notes on two-way Analysis of Variance.

Rationale

The drug Ritalin was designed to calm hyperactive children, but hyperactivity is a difficult

condition to diagnose, so it is important to know what effect Ritalin has on non-hyperactive

children. The following medical trial tested two groups of children, one non-hyperactive and the

other hyperactive. Each group was randomly divided with one half receiving Ritalin in tablet form,

and the other half a placebo (a salt tablet with no physiological effect). The following activity

responses were recorded on the four samples each of 4 children:

TREATMENT

Placebo Ritalin

CONDITION Non-hyperactive 50, 45, 55, 52 67, 60, 58, 65

Hyperactive 70, 72, 68, 75 51, 57, 48, 55

In this experimental design, the two independent variables are CONDITION (non-hyperactive or

hyperactive) and TREATMENT (placebo or Ritalin). Each CONDITION is tested with each level

of TREATMENT on replicate subjects. A design of this sort is called a ‘factorial design’ and it

allows us to test for a possible interaction between the two factors in their effects on the response

variable. Here the interaction we are seeking is whether the effect of Ritalin on activity depends on

the condition of the child. This could be a good thing, if for example the drug only influences

hyperactive children, or it could provide cautionary information, if the drug is found to have a

more pronounced effect on non-hyperactive than hyperactive children.

Analysis with R

Enter these data into a data frame from a .csv file (command line shown in the two-way

ANOVA lecture) or a .txt file (command line shown in the regression lecture). The data frame

should have 16 rows, one for each score labelled with its combination of treatment-by-condition:

Practical: Two-way Analysis of Variance in R

C. P. Doncaster 30

Treatment Condition Activity

Placebo Nonhyp 50

Placebo Nonhyp 45

Placebo Nonhyp 55

Placebo Nonhyp 52

Ritalin Nonhyp 67

Ritalin Nonhyp 60

: : :

Then use the same R commands as for the speed-reading analysis on page 16 to run the analysis

and produce an interaction plot. This requires that you specify the response variable and

explanatory factors in an ANOVA model of the form: ‘response ~ factor_1*factor_2’, meaning:

‘variation in the response is explained by the additive effects of factors 1 and 2 and by their

interaction’. You could equally spell out the model without using the ‘*’shorthand: ‘response ~

factor_1 + factor_2 + factor_1:factor_2’. Both expressions give identical results. In this case, the

model you are going to test with Analysis of Variance is that activity is influenced by treatment

and by the child’s condition, and by the interaction of treatment with condition. The model tests

these explained sources of variation in the response against unmeasured ‘residual’ variation.

Save the interaction plot and copy it into your report on the analysis.

Now check the residuals, by nesting the ‘aov(…)’ command within a ‘plot(…)’ command

(see example on page 14). The first two graphs suffice to show homogeneous variances – which is

the most important consideration, though with a rather flat distribution of residuals.

As with everything in R, if you are not sure how to do something, try it and see – you can’t break

the package! Save your commands in a ‘script’ file, so that you can use them again in the future,

and refer to them to see how you did things in the past. Do search the web for help, as usually

someone will have posted an answer to someone else’s similar problem. For example, if you want

to know more about interpreting the Normal Q-Q plot of residuals, try Googling ‘normal Q-Q

plots in R showing skew’.

Peruse the results of the ANOVA, noting that a separate F-value and associated p-value have

been produced for each of the main effects Treatment and Condition, and for the Treatment-by-

Condition interaction. Which effects are significant? How do we interpret these results? Refer to

the lecture notes on two-way ANOVA to be sure which d.f. apply to each F-value.

Interpretation

The analysis reveals something very interesting from a medical point of view, though it needs the

interaction plot to understand it. This plot illustrates qualitatively what the ANOVA described

statistically, and it unmasks the full effect of the drug… Hyperactive children are less active on

average with the drug than with the placebo. That is to be expected, but Non-hyperactive children

are more active on average with the drug than with the placebo. This is the significant interaction

effect that you will have obtained in the ANOVA. For each Treatment level, the point midway

between the two condition-level means indicates that Treatment-level mean after pooling levels of

Condition. These midway points are at an Activity score of about 58 for both Placebo and Ritalin,

which explains the non-significant main effect of Treatment. Does a non-significant main-effect of

Treatment indicate that the drug is ineffectual? No! The significant interaction means that the full

effects of the drug become apparent only when the condition of the children is taken into account.

Ritalin does affect activity, but although it subdues hyperactive children it raises the activity of non-

hyperactive children. This is one reason why it is a controversial drug that must be prescribed only

to hyperactive children. The take-home message for interpreting two-way ANOVA is to read the

ANOVA table from the bottom up, because the main effects only make sense in the light of the

interaction.

Lecture notes: Correlation and transformations

C. P. Doncaster 31

LECTURE: CORRELATION AND TRANSFORMATIONS

Review of ANOVA procedures in regression

We have seen how the significance of a simple regression line is calculated by one-way Analysis

of Variance. Our example used the statistical model: Weight = Age + . We evaluated how good

a predictor Age is in this model by partitioning the total observed variation in weight (measured

as the sum of squared deviations from the sample mean: [ y -y ]2 ) into a portion explained by

the line of best fit for Age against Weight (SS[Age] = [ŷ -y ]2), and an unexplained portion

(SS[] = [ y - ŷ ]2). We could then work out our F-statistic from the ratio of average explained

variation to average unexplained variation: F1,n-2 = MS[Age] / MS[].

Just as you can expand ANOVA from a one-way to a two-way analysis by introducing a second

factor (as we did in Lecture 2 and Practical 2 in this series), so you can expand regression from

simple- to multiple-regression, by introducing a second factor.

This second factor may be categorical, in which case you can plot the response variable against

the continuous factor, and calculate one regression line for each level of the categorical factor. If

the regression lines are not horizontal then you may have a significant continuous factor, and if

the lines do not coincide then you may have a significant categorical factor. If the regression lines

have different slopes, then you may have a significant interaction effect. The interaction plots

shown on p. 20 of this booklet illustrate some of the range of outcomes you could get - just think

of the x-axis as representing some continuous variable instead of the categorical factor ‘Sex’ (for

example ‘Age’), and the lines joining sample means then become regression lines for each level

of the categorical factor (in this case, ‘System’).

If the second factor is continuous rather than categorical, then you will need to illustrate these

data in a 3-dimensional graph, with the response on the vertical axis, and the two continuous

factors on orthogonal (i.e. ‘at right-angles’) horizontal axes. The best-fit model will then be a

plane through the data, as opposed to lines through the data.

With these more complicated models, the Analysis of Variance should be done with a balanced

design, so the same number of observations are recorded at each combination of factor levels.

The design can become unbalanced by missing data, or by using explanatory factors that are

correlated with each other and therefore non-orthogonal. For example if variation in body height

is modelled against right-leg length and against left-leg length, the second-entered explanatory

variable will appear to have no power to explain height while the first-entered explanatory

variable may appear highly significant. The problem is that the two variables are correlated with

each other, so the design is unbalanced by having missing data on short-left and long-right legs

and on short-right and long-left legs. In effect, the variables are not orthogonal to each other.

Having accounted for the variation explained by the first-entered factor there is then necessarily

little variation left over for explanation by the second-entered factor. The true relationship would

be better analysed with a one-factor regression on a single composite explanatory variable of ‘leg

length’ that uses the average of left and right lengths. For more on this topic see Doncaster &

Davey (2007 Analysis of Variance and Covariance, pages 237-242).

Lecture notes: Correlation and transformations

C. P. Doncaster 32

Correlation

For some types of investigation of covariance between continuous variables we may wish to seek

correlation without making predictions about how one variable is influenced by the other. For

example, if we have measures of body Volume for each Weight, we may not have an a priori

reason for knowing whether Volume determines Weight, or Weight determines Volume.

For the analysis of Weight and Age, in contrast, Age was clearly an explanatory (predictor, x)

variable and Weight the response (y) variable. The analysis of those two factors was predictive

because Age was hypothesised to influence Weight, but Weight could not under any

circumstances influence Age. Wherever we have employed Analysis of Variance up to now, it

has been used to explain variation in a response variable in terms of a predicted effect.

For the analysis of Weight and Volume we may not have a priori reasons for classifying one

variable as ‘effect’ and the other as ‘response’. We then restrict ourselves to seeking an inter-

dependency, or an association, between the two continuous variables. We can test for association

with the correlation coefficient r, because it’s value does not depend on which variable is on

which axis. The strength of correlation can still be tested with the Student’s-t or the Analysis of

Variance, as on page 24, because both these tests remain unchanged regardless of which variable

is x and which y.

The equation of the regression line does change, however, if we swap the axes. We can see what

happens to it by manipulating the regression we did of Weight with Age (pp. 24-25 and practical

3 - you can try this with the Excel sheet that you create for the practical). The equation for the

regression on page 25 was:

Weight = 420.6 – 18.2 Age

If the axes are swapped, a new regression equation is yielded: Age = 11.2 – 0.015 Weight,

which can be rearranged in terms of weight to give:

Weight = 724.5 – 64.9 Age

These two equations give entirely different predictions for weight change with age, and only the

first one is correct. The second equation illustrates the kind of error that you might get if you

used regression without respecting the requirement always to put the response variable on the

vertical axis and the predictor variable on the horizontal axis. The first equation predicts

correctly that cubs have an average weight at birth of 420 g (when Age = 0) and an average loss

rate of 18 gday-1, whereas the second equation erroneously predicts an average birth weight 1.7

times greater, and an average rate of weight loss 3.5 times greater, than these figures.

If you are in doubt about whether one of your variables is a true predictor, then do not put a line

of best fit through the plot. Just stick to the simple correlation coefficient r for evaluating the

association between the two variables. Use r instead of r2 because the sign of r provides valuable

information about whether the variables are positively or negatively correlated with each other.

Remember, however, that the correlation coefficient does assume the two variables have a linear

relation to each other. A perfect linear relation will return a value of |r| = 1.0, but a perfect curved

relation will return a value of |r| < 1.0. If your variables are not related to each other in some

direct proportion, then you may need to transform one or other axis in order to linearize the

relation (see p. 35).

Lecture notes: Correlation and transformations

C. P. Doncaster 33

The graphs below illustrate some types of correlation (from Fowler et al. 1998 Practical

Statistics for Field Biology. Wiley). Note that the last graph, of perfect rank correlation, would

give Spearman’s rank correlation coefficient rs = 1.0, which is clearly an over-estimate of the

true level of correlation. The non-parametric Spearman’s coefficient is simply Pearson’s

coefficient calculated on the ranks. Use the parametric Pearson’s in preference to Spearman’s

wherever you can meet its assumptions.

Lecture notes: Correlation and transformations

C. P. Doncaster 34

Transforming data to meet the assumptions of parametric Analysis of Variance

Analysis of variance has proved to be a powerful and versatile technique for analysing any kind

of response variable showing some variation around a mean value. We can use ANOVA to

explain this variation in terms of two or more levels of a factor (one-way ANOVA), or in terms

of the interacting levels of two or more factors (two-way ANOVA or multi-way ANOVA), or in

terms of one or more continuous factors (simple regression or multiple regression). We can also

use ANOVA to test the evidence for a correlation between two continuous variables.

Wherever you have observations of a continuous variable that you wish to explain in terms of

one or more factors, consider using Analysis of Variance before you think of using non-

parametric statistics. Parametric tests are more powerful because they use the actual data rather

than ranks, and for many types of data there simply is no appropriate non-parametric test (e.g.

regression, two-way analyses with categorical and continuous factors, interactions etc).

Having decided to use parametric Analysis of Variance, you must be aware of its underlying

assumptions (introduced on p. 6 of this booklet). If you also know the ways in which these are

likely to be violated, then you can pre-empt many potential difficulties by applying appropriate

transformations to the data. These are the assumptions:

1. Random sampling, so that your observations are a true reflection of the population from

which you took them.

Is it a problem? This is a basic assumption of all statistical analyses, parametric or

non-parametric. Whether or not it is met depends on sampling strategy. Solution: If

your data do not meet it, then you will have to resample your data.

2. Independent observations, so that the value of one data point cannot be predicted from the

value of another.

Is it a problem? This is a basic assumption of all statistical analyses, parametric or

non-parametric, and it depends on sampling strategy. Solution: If your data do not

meet it, then either resample your data or ‘factor out’ the non-independence by

adding a new explanatory factor (e.g. add the categorical factor ‘Subject’ if you have

repeated measures on each subject).

3. Homogeneity of variance around a regression line (for a covariate), or of variances around

sample means (for a factor), because the ANOVA uses pooled error variances to seek

differences between means, and it does not seek differences between variances.

Is it a problem? Depends on the type of observations. Often violated by observations

that cannot take negative values, such as weight, length, volume, counts etc, because

these are likely to have a variance that increases with the mean. Solution: log-

transformation of response (which for regression and correlation may then require

log-transformation of x also, to reinstate linearity).

4. Normal distribution of residual variation around a regression or around sample means,

because this distribution is described by just two parameters: the mean and variance, which

are the two employed by ANOVA (a skewed distribution needs to be described with a third

parameter, not accounted for in ANOVA).

Is it a problem? Generally less than heterogeneity, and depends on the type of

observations. May be violated by observations in the form of proportions or

percentages, because they are constrained to lie between zero and 1 or 100, whereas

the normal distribution has tails out to plus and minus infinity. Also violated by

observations in the form of counts, which follow a Poisson rather than a normal

distribution. Solution: Arcsine-root transformation of proportions, or logistic

regression on proportions (which assumes binomial rather than normal errors).

Lecture notes: Correlation and transformations

C. P. Doncaster 35

Square-root transformation of counts, or use a Generalised Linear Model (the ‘glm’

command in R) which can assume Poisson errors.

5. For regression and correlation: Linear relations between continuous variables, because the

explained and residual components of variation are measured against a predicted line

defined by just two parameters, the intercept a and slope b. A non-linear relation would

need describing with additional parameters, not accounted for in the regression analysis.

Is it a problem? Depends on the type of observations. Most likely to be violated by

relationships with an inherently non-linear biology. Solution: reinstate linearity with

an appropriate transformation to one or both axes – see four examples below.

Consider fitting a polynomial only if it makes sense biologically to model the

response with additive powers of the predictor.

If any of assumptions 3-5 are not met, we should not immediately abandon the use of parametric

statistics. The command ‘glm’ will run a General Linear Model that can accommodate Analysis

of Variance on data with inherently non-normal distributions, such as proportions (which have a

binomial distribution), or frequencies of rare events (with a Poisson distribution and variance

increasing with the mean response). Commands of the sort aov(Y ~ A) or lm(Y ~ A), which

we have been using up to now, have an equivalent in glm: anova(glm(Y ~ A, family =

gaussian(link = identity)), test = "F"). You can replace gaussian (for a normal

distribution) with poisson or binomial, as dictated by the type of data. This website shows a

worked example, for its model 5.9:

http://www.southampton.ac.uk/~cpd/anovas/datasets/ANOVA in R.htm

An alternative route to meeting the assumptions is by transformation of the response (commonly

with an arcsin-root transformation for proportions, or a square-root transformation for counts, or

a generic Box-Cox transformation). This is less desirable than modelling the error structure with

glm, because the transformation changes the nature of the test question.

For regression analyses in particular, you may have a priori reasons for suspecting a non-linear

relationship of response to predictor. An understanding of the underlying biology will often

suggest an appropriate linearizing transformation. Transformations are not cheating, because they

are planned in advance, and the same conversion is applied to all observations. The idea is to

reduce complexity by converting a non-linear relation to a linear one. Here are some examples:

1. The response may be inherently exponential, for example in population growth over time of

freely self-replicating organisms. A linear regression on ln(population) against time will give a

slope that equals the intrinsic rate of natural increase per capita.

2. Response and predictor may have different dimensions, for example in a weight response to

length (see p. 39), suggesting a power function. Logging both axes will linearize power-function

relationships, and simultaneously deal with associated issues of the variance increasing with the

mean response and skewed residuals.

3. The response may saturate, for example in the response of weight increase to body weight, or

the response of food consumption to food abundance. Linearization is achieved by understanding

the underlying biology: try inverse body weight, and try inverse consumption and abundance.

4. The response may be cyclic, for example in a circadian rhythm. Transformation of the

predictor with circular function (e.g., sin(x) or cos(x)) may linearize the relationship.

If you resort to non-parametric methods, be aware, that they all make assumptions 1 and 2 above.

Also, statistics on ranks (e.g., Spearman’s correlation) require that the ranks meet assumptions 3-

5. Finally, some data may not suit any statistics because they have too little variation (e.g. when

skewed by numerous zero values) or insufficient replication (e.g. data with too many missing

values). In such cases, change your test question to allow sub-sampling from the dataset.

C. P. Doncaster 36

Lecture notes: Fitting statistical models

C. P. Doncaster 37

LECTURE: FITTING STATISTICAL MODELS TO DATA

Statistical packages like R all work by fitting models to data. They require you to use an

appropriate model for the samples and variables under investigation, before they will estimate

parameter values that best fit the data. These pages will help you fit appropriate models to data.

In the first example (A1) below, the model formula is a mathematical relationship:

describing the probability of obtaining exactly 0, 1, 2,... species of insects per leaf. But the other

examples all use a standard convention for presenting statistical models, which takes the form:

response variable(s) = explanatory variable(s). Here the ‘=‘ sign is simply a statement of the

hypothesised relationship between the variables rather than a logical equality. The chosen

statistic will quantify the relationship of the response variable (continuous except in A2a) to the

explanatory variables (which can be continuous: A2b & B1, or divided into samples: A3 & B2).

A. The three principal types of data and statistical models

1. One sample, one variable

For data of this kind, look for a goodness-of-fit of frequencies

E.g. The sample is 50 leaves of Sycamore picked at random; the variable is the number of species

of insect parasites per leaf. This is predicted to follow a random distribution, so the appropriate

model for calculating expected frequencies is the Poisson distribution. x species per leaf 0 1 2 3 4+ Total

Observed frequencies 3 22 15 6 4 50 Expected frequencies 8.43 15.01 13.36 7.93 5.28 50

(O - E)2 / E 3.50 3.26 0.20 0.47 0.31 7.73

0

5

10

15

20

25

0 1 2 3 4 5 6

Number of species per leaf

Fre

qu

en

cy

of

lea

ve

s

Observed

Expected

H0: Observed distribution is no different to the expected Poisson (i.e. no

interaction between species

Test statistic: Chi-squared or G- test of association

Outcome: X3 2 = 7.73, p < 0.05

Conclusion: observed numbers of species differ from random expectation. Since the

observed distribution is narrower than expected, the species are more

regularly spaced than random, with one per leaf predominating (indicating

mutual repulsion in competition between the species)

Assumptions: data are nominal (not continuous) , frequencies are independent (i.e. 50

independent leaves) , no cell with expected value < 5 .

For continuous data, use Kolmogorov-Smirnov test.

Model formula: Poisson distribution: x = 1.78 species/leaf

Lecture notes: Fitting statistical models

C. P. Doncaster 38

2. One sample, two variables

For data of this kind, look for a dependent relationship (an association) between the variables

(a) Categorical variables

Use a contingency table of frequencies to look for an interaction between the variables

E.g. Sample is 2-year old infants, variables are eye colour and behavioural dominance.

Contingency table Eye colour

Blue Other Total

Behaviour Dominant 13 7 20

Submissive 22 29 51

Total

35

36

71

0

10

20

30

Blue Other Blue Other

Dominant Dominant Submissive Submissive

Variables

Fre

qu

en

cy

H0: Column categories are independent of row categories.

Test statistic: chi-squared or G- test of independence

Outcome: X12 = 1.942, p = 0.16

Conclusion: there is no detectable interaction of colour with behaviour: behavioural

dominance is not associated with blue eyes

Assumptions: data are truly categorical (frequency in each cell conforms to a Poisson

distribution) , frequencies are independent (71 independent subjects, e.g.

no siblings) , no cell with expected value < 5 , correction for

continuity .

For cells with expected values < 5, use Fisher’s exact test.

Model formula: colour:behaviour ~ _response_

Lecture notes: Fitting statistical models

C. P. Doncaster 39

(b) continuous variables

Plot the response variable on the y-axis against the explanatory variable on the x-axis

E.g. Sample is polar bears; response variable is body weight and explanatory variable is radius

length.

Subject Body weight (kg) Radius length (cm)

1 2 3 4 5 6 7 8 9

10 11 12 13 14

65 70 74 142 121 80 108 344 371 416 432 348 476 478

45.0 47.5 57.0 59.5 62.0 53.0 56.0 67.5 78.0 72.0 77.0 72.0 75.0 75.0

: 143

: :

: :

H0: Variation in body weight is independent of radius length.

Test statistic: Linear regression on transformed weight and radius length (Ln[Weight]

labelled as a new variable ‘ln.Weight’; ln[Length] labelled ‘ln.Length’)

Outcome: F1,141 = 944.6, p < 0.0001

Conclusion: the regression slope is differs from zero; radius length is a precise

predictor of body weight, explaining 87% of the variance in body weight

with the chosen model.

Assumptions: (i) random sampling , (ii) independent errors , (iii) homogeneity of

variances , (iv) normal distribution of errors , (v) linearity .

For continuous variables with no clear functional relationship, use correlation to calculate r.

Model formula: ln.Weight ~ ln.Length

Lecture notes: Fitting statistical models

C. P. Doncaster 40

3. One-way classification of two (or more) samples

For data of this kind, look for a difference between sample means

E.g. Samples are two levels of a feeding regime for shrews: a diet of blow-fly pupae, and a diet

of dung-fly pupae. The response variable is weight (g).

Feeding regime

blow-fly diet (g) dung-fly diet (g)

5 2

10 6 5 8 4 2 7

12 9 3

4 10 7 9

15 12 8

11 13 17 5

10 11

n subjects = 12 13

Mean = 6.08 10.15

Standard error = 0.92 1.02

0

4

8

12

16

Blowfly Dungfly

Diet

Bo

dy

we

igh

t (g

)

H0: Feeding regime has no effect on weight (the two samples come from the

same population)

Test statistic: Analysis of Variance (or t-test when just two groups)

Outcome: F1,23 = 8.60, p < 0.01

Conclusion: shrew body weights depend on type of feeding regime

Assumptions: (i) random sampling , (ii) independent errors , (iii) homogeneity of

variances , (iv) normal distribution of errors .

For data with repeated measures on subjects (assumption (ii)), use repeated measures ANOVA;

for data that violate assumptions (iii) - (iv) use prior transformations, or use non-parametric

Kruskal-Wallis test (or Mann-Whitney if have just two samples).

Model formula: weight ~ regime

Lecture notes: Fitting statistical models

C. P. Doncaster 41

B. Selecting and fitting models to data

R offers many alternative commands for Analysis of Variance. The command ‘aov’ will suit

most straightforward analyses with normally distributed residuals. The command ‘glm’ will run

a General Linear Model that can accommodate Analysis of Variance on data with inherently non-

normal distributions, such as proportions (which have a binomial distribution), or frequencies of

rare events (with a Poisson distribution).

1. One-way classification of two (or more) samples, two continuous variables

For data of this kind, look for differences between regression slopes

E.g. Samples are male (circles and continuous line) and female (triangles and broken line) polar

bears; response variable is body weight and explanatory variable is radius length.

Subject Body weight (kg) Radius length (cm) Sex

1 2 3 4 5

65 70 74 142 121

45.0 47.5 57.0 59.5 62.0

M F F M F

: 143

: :

: :

: :

H0: Variation in body weight is independent of radius length by sex.

Test statistic: Analysis of Variance on ln.Weight with covariate ln.Length (or General

Linear Model for non-normal error structures).

Outcome: ln.Length effect (adjusted for Sex) F1,139 = 1003.66, p < 0.0001

Sex effect (adjusted for ln.Length) F1,139 = 3.57, p = 0.06

Sex-by-ln.Length interaction F1,139 = 7.24, p = 0.008

Conclusion: the two regression lines have different slopes, so the effect of radius length

on weight differs by sex

Assumptions: (i) random sampling , (ii) independent errors , (iii) homogeneity of

variances , (iv) normal distribution of errors , (v) linearity .

Model formula: ln.Weight ~ ln.Length + Sex + ln.Length:Sex

Lecture notes: Fitting statistical models

C. P. Doncaster 42

2. Two-way classification of samples

For data of this kind, look for two-way differences between means

E.g. Shrew samples are classified by feeding regime and sex; response variable is body weight as

in Analysis of Variance above.

Feeding regime

blow-fly diet dung-fly diet

Sex

females

2 2 9 4 5 5

10 11 5

13 15 17 11

males

6 7 8 3 10 12

4 12 7 8 9

10

0

4

8

12

16

Blowfly Dungfly

Diet

Bo

dy

we

igh

t (g

)

Female

Male

H0: The effect of regime on weight is not affected by sex

Test statistic: Analysis of Variance (or General Linear Model for non-normal error

structures).

Outcome: sex effect (adjusted for regime) F1,21 = 0.01, p = 0.933

regime effect (adjusted for sex) F1,21 = 9.68, p < 0.005

regime:sex interaction effect F1,21 = 6.68, p < 0.05

Conclusion: the effect of regime on weight depends on sex, with females doing better

on dungflies and males on blowflies

Assumptions: (i) random sampling , (ii) independent errors , (iii) homogeneity of

variances , (iv) normal distribution of errors .

Model formula: weight ~ regime + sex + regime:sex

Practical: Regression and correlation

C. P. Doncaster 43

PRACTICAL: CALCULATING REGRESSION AND CORRELATION

In this practical you will do ‘by hand’ the linear regression shown on pages 24-25 of this booklet.

To save tedious calculations, however, you will put Excel to work by asking it to do all of the

arithmetic for you. This still means that you will need to understand how the regression analysis

works, so refer to pages 26-27 as you follow the steps through on the computer. Look back

through the notes for lectures 3 and 4 to appreciate the underlying logic of the analysis.

First run the practical in R, using the commands on page 24 of the booklet. Then open up Excel.

On a fresh spreadsheet, type in the data shown in rows 4 to 15 of columns B and F in the Excel

worksheet illustrated on page 25 of this booklet. Don’t type in any more data than just these two

columns. Excel will do the rest! But you have to tell it what to do...

Your task is now to use Excel formulae to obtain all the figures as they appear in the other cells

and columns. Your objective is to replicate the entire sheet shown on page 25 without typing in

any more numbers. When you have done this, save the result, as you may wish to use it again.

In order to use Excel formulae, you must type an ‘=‘ sign in a cell where you wish to calculate a

number from data in other cells. For example, to obtain a value in cell B19 for the mean age, type

in cell B19:

‘=AVERAGE(B4:B15)’

Likewise, to obtain a value in cell F19 for the mean weight, type in cell F19:

‘=AVERAGE(F4:F15)’

Now to obtain a value in cell H4 for the squared deviation of the first Weight value (in cell F4)

from its sample mean (which you have just calculated in F19), type in cell H4:

‘=(F4-$F$19)^2’

Having entered this command, you can repeat it down through the whole of column H from H4

to H15 by clicking on the bottom right corner of the cell and dragging down to H15. Look at the

formulae you have created to check that they are giving you squared deviations of each weight

value from the sample mean. You should now see in column H the full set of squared deviations

of Weights from their sample mean. Now get the sum of squared deviations: SS(Total) in cell

H17 by typing ‘=SUM(H4:H15)’

Likewise, to obtain a value in cell J4 for the product of the first Weight deviation with its

corresponding Age deviation, type in J4:

‘=(B4-$B$19)*(F4-$F$19)’

Then drag that formula down to J15 in order to get all the products. Finally, get the sum of

squared deviations in cell J17 by typing ‘=SUM(J4:J15)’.

Do a similar operation for column D, then calculate the parameters for the slope and intercept of

the line. Use these parameter constants to obtain for each x a predicted y = a +bx, in order to

then calculate the values in columns L and N. Finally calculate the explained and error SS and

MS, and the F-value. Check that your sheet matches the one on page 25. You can then ‘play’

with the data to see what difference it makes to the significance of the relationship if you change

just one of the values. For example, change the Weight value in cell F12 from 431 g to 231 g. Is

the relationship now significant? Has the magnitude of the correlation coefficient r got closer to

unity? Playing with test data in this way will help you to understand how the statistics work. But

don’t try this with real data! If you had actually observed a Weight of 431 g, then you would have

to work with that. If the outcome is a non-significant relationship, then your best explanation is

no detectable relationship (failure to reject H0), given the assumptions of the analysis.

C. P. Doncaster 44

Appendix 1: Terminology of Analysis of Variance

C. P. Doncaster 45

APPENDIX 1: TERMINOLOGY OF ANALYSIS OF VARIANCE

Once you have familiarised yourself with the terminology of Analysis of Variance you will find it

easier to grasp many of the parametric techniques that you read about in statistics books. Some of

the terms described below may be referred to by one of many names, as indicated in the left hand

column. They are illustrated here with a simple example of statistical analysis, in which a biologist

wishes to explain variation in the body weights of a sample of people according to different

variables such as their height, sex and nationality. More detailed descriptions of the terms shown

below, as well as many others that go beyond your immediate needs, can be found in the Lexicon

of Statistical Modelling (http://www.geodata.soton.ac.uk/biology/lexstats.html).

Term Description

1. Variable A property that varies in a measurable way between subjects in a sample.

2. Response variable,

Dependent variable,

Y

The variable of interest, usually measured on a continuous scale, of (e.g.

weight: what causes variation in weight?). If these measurements are free to

vary in response to the explanatory variable(s), statistical analysis will reveal

the explanatory power of the hypothesised source(s) of variation.

3. Explanatory variable,

Independent variable,

Predictor variable,

Factor,

Effect,

X

The non-random measurements or observations (e.g. treatments of a ‘drug’

factor, fixed by experimental design), which are hypothesised in a statistical

model to have predictive power over the response variable. This hypothesis is

tested by calculating sums of squares and looking for a variation in Y between

levels of X that exceeds the variation within levels. An explanatory variable

can be categorical (e.g. sex, with 2 levels of male and female), or continuous

(e.g. height with a continuum of possibilities). The explanatory variable is

assumed to be ‘independent’ in the sense of being independent of the response

variable: i.e. weight can vary with height, but height is independent of weight.

The values of X are assumed to be measured precisely, without error,

permitting an accurate estimate of their influence on Y.

4. Variates,

Replicates,

Observations,

Scores,

Data points

The replicate observations of the response variable ()

measured at each level of the explanatory variable. These are the data points,

each usually obtained from a different subject to ensure that the sample size

reflects n independent replicates (i.e. it is not inflated by non-independent

data: ‘pseudoreplication’).

5. Sample,

Treatment

The collection of observations measured at a level of X (e.g. body weights

from one sample of males and another of females to test the effect of Sex on

Weight; or crop Yield tested with two Pesticide treatments). If X is continuous

the sample comprises all measures of Y on X (e.g. Weight on Height).

6. Sum of squares The squared distance between each data point, , and the sample mean,Y,

summed for all n data points. The squared deviations measure variation in a

form which can be partitioned into different components that sum to give the

total variation (e.g. the component of variation between samples and the

component of variation within samples).

7. Variance The variance in a normally distributed population is described by the average

of n squared deviations from the mean. Variance usually refers to a sample,

however, in which case it is calculated as the sum of squares divided by n-1

rather than n. Its positive root is then the standard deviation, SD, which

describes the dispersion of normally distributed variates (e.g. 95% lying within

1.96 standard deviations of the mean when n is large).

Appendix 1: Terminology of Analysis of Variance

C. P. Doncaster 46

8. Statistical model,

Y = X +

A statement of the hypothesised relationship in the sampled population

between the response variable and the predictor variable. A simple model

would be: Weight = Sex + . The ‘=‘ does not signify a literal equality, but a

statistical dependency. So the statistical analysis is going to test the hypothesis

that variation in the response variable on the left of the equals sign (Weight) is

explained or predicted by the factor on the right (Sex), in addition to a

component of random variation (the error term , ‘epsilon’). An Analysis of

Variance will test whether significantly more of the variation in Weight falls

between the categories of ‘male’ and ‘female’, and so is explained by the

independent variable ‘Sex’ than lies within each category (the random

variation ). The error term is often dropped from the model description

though it is always present in the model structure, as the random variation

against which to calibrate the variation between levels of X in the F-ratio.

9. Null hypothesis,

While a statistical model proposes a hypothesis, e.g., that Y depends on X, the

statistical analysis can only seek to reject a null hypothesis: that Y does not

vary with X in the population of interest. This is because it is always easier to

find out how different things are than to know how much they are the same, so

the statistician’s easiest objective is to establish the probability of a deviation

away from random expectation rather than towards any particular alternative.

Thus does science in general proceed cautiously by a process of refutation. If

the analysis reveals a sufficiently small probability that the null hypothesis is

true, then we can reject it and state that Y evidently depends on X in some way.

10. One-way ANOVA,

Y = X

An Analysis of Variance (ANOVA) to test the model hypothesis that variation

in the response variable Y can be partitioned into the different levels of a

single explanatory variable X (e.g. Weight = Sex). If X is a continuous

variable, then the analysis is equivalent to a linear regression, which tests for

evidence of a slope in the best fit line describing change of Y with X (e.g.

Weight with Height).

11. Two-way ANOVA,

Y = X1 + X2 + X1X2

Test of the hypothesis that variation in Y can be explained by one or both

variables X1 and X2. If X1 and X2 are categorical and Y has been measured

only once in each combination of levels of X1 and X2, then the interaction

effect X1X2 cannot be estimated. Otherwise a significant interaction term

means that the effect of X1 is modulated by X2 (e.g. the effect of Sex, X1, on

Weight, Y, depends on Nationality, X2). If one of the explanatory variables is

continuous, then the analysis is equivalent to a linear regression with one line

for each level of the categorical variable (e.g. graph of Weight by Height, with

one line for males and one for females): different intercepts may signify a

significant effect of the categorical variable, different slopes may signify a

significant interaction effect with the continuous variable.

12. Error,

Residual

The amount by which an observed variate differs from the value predicted by

the model. Errors or residuals are the segments of scores not accounted for by

the analysis. In Analysis of Variance, the errors are assumed to be independent

of each other, and normally distributed about the sample means. They are also

assumed to be identically distributed for each sample (since the analysis is

testing only for a difference between means in the sampled population), which

is known as the assumption of homogeneity of variances.

13. Normal distribution A bell-shaped frequency distribution of a continuous variable. The formula for

the normal distribution contains two parameters: the mean, giving its location,

and the standard deviation, giving the shape of the symmetrical ‘bell’. This

distribution arises commonly in nature when myriad independent forces,

themselves subject to variation, combine additively to produce a central

tendency. The technique of Analysis of Variance is constructed on the

assumption that the component of random variation takes a normal

distribution. This is because the sums of squares that are used to describe

Appendix 1: Terminology of Analysis of Variance

C. P. Doncaster 47

variance in an ANOVA accurately reflect the true variation between and

within samples only if the residuals are normally distributed about sample

means.

14. Degrees of freedom,

d.f.

The number of pieces of information that we have on a response, minus the

number needed to calculate its variation. The F-ratio in an Analysis of

Variance is always presented with two sets of degrees of freedom, the first

corresponding to one less than the a samples or levels of the explanatory

variable (a - 1), and the second to the remaining error degrees of freedom (n -

a). For example, a one-way ANOVA may find an effect of nationality on body

weight ( = 3.10, p < 0.05) in a test of four nations (giving the 3 test

degrees of freedom) sampled with 27 subjects (giving the 23 error degrees of

freedom). A continuous factor has one degree of freedom, so the linear

regression ANOVA has 1 and n-2 degrees of freedom (e.g. a height effect on

body weight: = 4.27, p < 0.05, from 27 subjects).

15. F-statistic,

F-ratio

The statistic calculated by Analysis of Variance, which reveals the

significance of the hypothesis that Y depends on X. It comprises the ratio of

two mean-squares: MS[X] / MS[]. The mean-square, MS, is the average sum

of squares, in other words the sum of squared deviations from the mean X or (as defined above) divided by the appropriate degrees of freedom. This is why

the F-ratio is always presented with two degrees of freedom, one used to create

the numerator MS[X], and one the denominator, MS[]. The F-ratio tells us

precisely how much more of the total variation in Y is explained by X (MS[X])

than is due to random, unexplained, variation (MS[]). A large ratio indicates a

significant effect of X. In fact, the observed F-ratio is connected by a very

complicated equation to the exact probability of a true null hypothesis, i.e. that

the ratio equals unity, but you can use standard tables to find out whether the

observed F-ratio indicates <5% probability of making a mistake in rejecting a

true null hypothesis.

16. Significance,

p

This is the probability of mistakenly rejecting a null hypothesis that is actually

true. In the biological sciences a critical value = 0.05 is generally taken as

marking an acceptable boundary of significance. A large F-ratio signifies a

small probability that the null hypothesis is true. Thus detection of a

nationality effect: = 3.10, p < 0.05 means that the variation in weight

between the samples from four nations is 3.10 times greater than the variation

within samples, making these data incompatible with a null hypothesis of

nationality having no effect on weight. The height effect detected in the linear

regression ( = 4.27, p < 0.05) means that the distribution of data is

incompatible with height having no influence on weight in the sampled

population. This regression line takes the form: , and 95%

confidence intervals for the estimated slope are obtained at ; if

the slope is significant, then these intervals will not encompass zero.

C. P. Doncaster 48

Appendix 2: Self-test questions on Analysis of Variance

C. P. Doncaster 49

APPENDIX 2: SELF-TEST QUESTIONS ON ANALYSIS OF VARIANCE

1. Write down the formula for calculating the variance of a sample of scores (use Yi to denote a

score for each of n subjects). Explain in words what is meant by this expression.

2. Write down the formula for the standard error of the mean. Explain in words what is meant by

this expression. Why does it get smaller as n increases?

3. A sample of 8 male blackbirds are tested for response times to an alarm signal, and this is

compared to responses of a sample of 9 females. The Analysis of Variance gives a value of F =

4.56. Use tables of critical values of F to decide whether mean responses differ between males

and females. The problem could also have been answered with a t-test, in which case the test

would have produced a value of t = 2.135, which is the square root of 4.56. For both tests,

critical values are looked up in tables using the same error degrees of freedom. Look up the

critical value of t at = 0.05 and then square it. Check that this corresponds with the equivalent

critical value of F. This shows you that an ANOVA on two samples is equivalent to a t test.

4. State the model for the above Analysis of Variance. If we increased the sample sizes to 12 of

each sex and added a third sample of 12 neutered males, what would be the degrees of freedom

for the Analysis of Variance?

5. If we divided each of the samples into three groups, of 4 chicks, 4 juveniles, and 4 adults, we

could then test the alarm response against two independent effects: SEX and AGE. Write out

the full model and give the degrees of freedom for each term.

6. If SEX and AGE main effects were significant, but the SEX:AGE interaction was not, sketch

out how the interaction plot might look. Sketch another plot showing how it might look if the

interaction effect was also significant.

7. How would you interpret the outcome of the experiment if the interaction effect was

significant?

8. As part of your research project, you want to find out how root growth of lawn grasses is

influenced by frequency of mowing under different conditions of watering. You decide to use

urban gardens as sources of independent grass plots, purloining the services of willing

householders to provide different mowing and watering regimes. Describe how you would

design your methods so that the data could be analysed with a two-way Analysis of Variance.

(Hint: think how you want the data to look in a design matrix of the sort we have been using in

previous examples - requires thinking through carefully!).

9. Interpret the following output from a statistics package

The regression equation is

Log(survival) = 11.4 + 11.6 Temperature

Predictor Coef StDev T P

Constant 11.417 5.309 2.15 0.047

Temperat 11.6115 0.9931 11.69 0.000

S = 2.953 R-Sq = 89.5% R-Sq(adj) = 88.9%

Analysis of Variance

Source DF SS MS F P

Regression 1 1192.1 1192.1 136.71 0.000

Residual Error 16 139.5 8.7

Total 17 1331.6

C. P. Doncaster 50

Appendix 3: Worked examples of Analysis of Variance

C. P. Doncaster 51

APPENDIX 3: SOURCES OF WORKED EXAMPLES IN ANALYSIS OF

VARIANCE

1. One-way Analysis of Variance

Fowler, J. & Cohen, L. 1998. Practical Statistics for Field Biology. John Wiley. Chapter 17.

Section 17.3 (p. 181)

Samuels, M.L. 1991. Statistics for the Life Sciences. Maxwell Macmillan. Chapter 12.

Example 12.1-12.9 (p. 390-406)

Exercises 12.1-12.14 (with answers at back of book)

Sokal, R.R. & Rohlf, F.J. 1995. Biometry, 3rd Edition. Freeman. Chapters 8 and 9.

Table 8.1 (p. 181) and Table 8.5

Table 8.3 (p. 192) and Table 8.6

Box 9.1 (p. 210) - unequal sample sizes

Box 9.4 (p. 218) - equal sample sizes

Zar, J.H. 1984. Biostatistical Analysis, 2nd Edition. Prentice-Hall. Chapter 11.

Example 11.1 (p. 164)

2. Two-way Analysis of Variance

Fowler, J. & Cohen, L. 1998. Practical Statistics for Field Biology. John Wiley. Chapter 17.

Section 17.6 (p. 190)

Sokal, R.R. & Rohlf, F.J. 1995. Biometry, 3rd Edition. Freeman. Chapter 11.

Box 11.1 (p. 324) - cross factored analysis

Table 11.1 (p. 327) - meaning of interaction: equivalent to Fig. 1.7 in your ANOVA notes.

Box 11.2 (p. 332)

Zar, J.H. 1984. Biostatistical Analysis, 2nd Edition. Prentice-Hall. Chapter 13.

Example 13.1 (p. 207)

C. P. Doncaster 52

Appendix 4: Procedural steps for ANOVA

C. P. Doncaster 53

APPENDIX 4: SUMMARY OF PROCEDURAL STEPS FOR ANALYSIS OF

VARIANCE

Assumptions

met?

ANALYSIS OF VARIANCE

F#,# = ##.##, P < 0.0#

INTERPRETATION

•Higher order interactions first

•Equation and r2 for regression

•Pearson's r for correlation

PLOT

TRANSFORM

DIAGNOSTICS

•Random

•Independent

•Normal

•Homogenous

•Linear

NO

YES

OBSERVATIONS

C. P. Doncaster 54

Appendix 5: Self-test questions on regression and correlation

C. P. Doncaster 55

APPENDIX 5: SELF-TEST QUESTIONS ON ANALYSIS OF VARIANCE (2)

1. A colleague tells you he has data on the activity of three daphnia at each of six levels of

pH, and he needs advice on analysis.

a) What extra information do you need to know before you can advise on doing any

statistical tests at all?

b) If you are satisfied that statistical analysis is appropriate, are these data suitable for

Analysis of variance, and/or regression, and/or correlation? Should it be parametric

or non-parametric?

c) Significance would be tested with how many degrees of freedom?

2. You have three samples of wheat grains, one of which comes from genetically modified

parent plants, one from organic farming, and the third from conventional farming. You

want to find out if these different practices make a difference to the weight of seeds. What

are your options for analysis?

a) Regression.

b) Chi-squared test on the frequencies in different weight categories.

c) Kruskal-Wallis test on the three samples.

d) Analysis of variance on the three samples.

e) Student’s t-tests on each combination of pairs to find out how their averages differ

from each other.

3. You have a packet of wild-type tomato seeds and a packet of genetically modified tomato

seeds, and you want to know whether they give different crop yields under a conventional

growing regime and under an ‘organic’ regime. How do you find out?

4. What, if anything, is wrong with each of these reports?

a) “The data is plotted in graph 2, and it shows a significant change with temperature

(F1 = 23.71625, P = 0.000).”

b) “Figure 2 shows that temperature has a strong positive influence on activity across

this range (r2 = 0.78, F1,10 = 23.72, P < 0.001).”

c) “There is a strong negative correlation but the results are not significant (r = -0.64, P

= 0.06).”

d) “No correlation could be established from the nine observations (Pearson’s

coefficient r = -0.64, d.f. = 7, P > 0.05).”

5. Interpret the following command and output from an analysis in R: > summary(aov(Y ~ A*B))

Df Sum Sq Mean Sq F value Pr(>F)

A 2 0.61 0.30 1.393 0.3184

B 1 0.97 0.97 4.465 0.0791 .

A:B 2 136.07 68.03 312.974 8.56e-07 ***

Residuals 6 1.30 0.22

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

>

C. P. Doncaster 56

Appendix 6: Worked examples of regression

C. P. Doncaster 57

APPENDIX 6: SOURCES OF WORKED EXAMPLES ON REGRESSION

AND CORRELATION

Doncaster, C.P. & Davey, A.J.H. 2007. Analysis of Variance and Covariance: How to Choose

and Construct Models for the Life Sciences. Cambridge University Press.

- Pages 46-57.

- See the book’s web pages for:

Worked examples of all Analysis of Variance models:

http://www.southampton.ac.uk/~cpd/anovas/datasets/

Commands for analysing them in R:

http://www.southampton.ac.uk/~cpd/anovas/datasets/ANOVA in R.htm

Fowler, J. et al. 1998. Practical Statistics for Field Biology. John Wiley.

- Chapters 14-15.

- Section 14.5 (p. 135)

- Section 15.6 (p. 147)

- Sections 15.12 to 15.15 (p. 156)

Samuels, M.L. 1991. Statistics for the Life Sciences. Maxwell Macmillan.

- Chapter 13.

- Numerous examples throughout this chapter, and exercises (pp. 449, 463, 474, 484 and 493,

with answers at back of book)

Sokal, R.R. & Rohlf, F.J. 1995. Biometry, 3rd Edition. Freeman.

- Chapters 14-15.

- Table 14.1 (p. 459)

- Box 14.1 (p. 465)

Zar, J.H. 1984. Biostatistical Analysis, 2nd Edition. Prentice-Hall.

- Chapters 17, 19.

- Examples 17.1 (p. 262), and 17.9 (p. 286).

- Examples 19.1 (p. 308)

Further reference information on statistical modelling with ANOVA and regression can be found

in the Lexicon of Statistical Modelling at: http://www.geodata.soton.ac.uk/biology/lexstats.html.

C. P. Doncaster 58

Appendix 7: Critical values of the F-distribution

C. P. Doncaster 59

APPENDIX 7: CRITICAL VALUES OF THE F-DISTRIBUTION

v1 is the degrees of freedom of the numerator means squares;

v2 is the degrees of freedom of the denominator means squares.

Note that the power of Analysis of Variance to detect differences can be increased if the total

number of variates is divided into more samples. For example:

(i) 2 samples with 9 variates in each, so n = 18, has critical F1,16 = 4.49

(ii) 3 samples with 6 variates in each, so n = 18, has critical F2,15 = 3.68

(iii) 6 samples with 3 variates in each, so n = 18, has critical F5,12 = 3.11

All three tests require collecting the same amount of data. The first one can only detect a

difference in the sampled population if the variance between samples is more than four times

greater than the variance within samples. The third one, in contrast can detect a difference from a

between-sample variance little more than three times greater than the within-sample variance.