SADC Course in Statistics Analysis of Variance with two factors (Session 13)

SADC Course in Statistics

Analysis of Variance with two factors

(Session 13)

2To put your footer here go to View > Header and Footer

Learning Objectives

At the end of this session, you will be able to

• understand and interpret the components of a linear model with two categorical factors

• fit a model involving two factors, interpret the output and present the results

• understand the difference between raw means and adjusted means

• appreciate that a residual analysis is the same with more complex models


Using Paddy again!

In the paddy example, there were two categorical factors, variety and village.

Here we will look at a model including both factors and the corresponding output.

We will also discuss assumptions associated with anova models with categorical factors and procedures to check these assumptions.


A model using two factors

Objective here is to compare paddy yields across the 3 varieties and also across villages.

A linear model for this takes the form:

yij = 0 + vi + gj + ij

Here 0 represents a constant, and the gj (i=1,2,3)

represent the variety effect as before.

We also have the term vi (i=1,2,3,4) to represent the

village effect.


Anova resultsSource d.f. S.S. M.S. F Prob.

Village 3 13.91 4.64 14.0 0.000

Variety 2 25.68 12.84 38.7 0.000

Residual 30 9.95 0.3318

Total 35 49.55

Above is a two-way anova since there are two factors explaining the variability in paddy yields.

Again the Residual M.S. (s2) = 0.3318 describes the variation not explained by village and variety.


Sample sizes

Above shows data is not balanced. Hence need to worry about the order of fitting terms. How then should we interpret the sequential S.S.’s shown in slide 5 anova?

--------+-----------------------+------- | Variety | Village | New Old Trad | Total--------+-----------------------+------- KESEN | 0 3 4 | 7 NANDA | 2 7 5 | 14 NIKO | 0 2 3 | 5 SABEY | 2 5 3 | 10 --------+-----------------------+------- Total | 4 17 15 | 36 --------+-----------------------+-------


Anova with adjusted SS and MS

Source d.f. Adj.S.S. Adj.M.S. F Prob.

Village 3 4.32 1.44 4.34 0.012

Variety 2 25.68 12.84 38.7 0.000

Residual 33 9.95 0.3318

Total 35 49.55

How may the above results be interpreted?

What are your conclusions?


Model estimates

Parameter Coeff. Std.error t t prob

0 :constant 5.284 0.386 13.7 0.000

v2 (Nanda) 0.718 0.272 2.63 0.013

v3 (Niko) -0.179 0.337 -0.53 0.599

v4 (Sabey) 0.633 0.294 2.16 0.039

g2 (old) -1.201 0.327 -3.67 0.001

g3 (trad) -2.614 0.340 -7.68 0.000

What do these results tell us?


Relating estimates to means

Again: Old - New = -1.201 = Estimate of g2

Trad - New = -2.614 = Estimate of g3

This is similar to the case with one categorical factor – can make comparisons easily with the “base” level using model estimates.

But when sample sizes are unequal across the two categorical factors, results should be reported in terms of adjusted means!


Raw means and adjusted means

Sample Raw Std.error

Variety Size(n) Means (s.d./n)

New improved 4 5.96 0.128

Old improved 17 4.54 0.173

Traditional 15 3.00 0.168

Variety Adjusted means Std.error (s/n)

New improved 5.58 0.308

Old improved 4.38 0.148

Traditional 2.96 0.150

Model based summaries (adjusted means):


Computing adjusted meansThe model equation

yij = 0 + vi + gj + ij

can be used to find the variety adjusted means

e.g. adjusted mean for traditional variety is:

= 5.284+0.25[0+0.718–0.179+0.633]–2.614

= 2.963

Thus the variety adjusted mean is an average over the 4 villages.

1 2 3 40 3

ˆ ˆ ˆ ˆv v v vˆ g4


Checking model assumptions

Anova model with two categorical factors is:

yij = 0 + gi + vj + ij

Model assumptions are associated with the ij.

These are checked in exactly the same way as before.

A residual analysis is done, looking at plots of residuals in various ways.

We give below a residual analysis for the model fitted above.


Histogram to check normality

Histogram of standardised residuals after fitting a model of yield on village and variety.

0.1

.2.3

.4.5

De

nsity

-2 -1 0 1 2Standardized residuals


A normal probability plot…

Another check on the normality assumption

Do you think the points follow a straight line?

-2-1

01

2S

tand

ard

ize

d re

sidu

als

-2 -1 0 1 2Inverse Normal


Std. residuals versus fitted values

Checking assumption of variance homogeneity, and identification of outliers:

What can you say here about the variance homogeneity assumption?

-2-1

01

2S

tand

ard

ize

d re

sidu

als

2 3 4 5 6Fitted values


Finally… know your softwareDifferent software packages impose different constraints on model parameters so need to be aware what this is.

For example, Stata and Genstat set the first level of the factor to zero. SPSS and SAS set the last level to zero. Minitab imposes a constraint that sets the sum of the parameter estimates to zero!

Check also whether the software produces sequential or adjusted or some other form of sums of squares. The correct interpretation of anova results would depend on this.


Practical work follows to ensure learning objectives are

achieved…

Documents

SADC Course in Statistics Analysis of Variance with two factors (Session 13)