17
SADC Course in Statistics Analysis of Variance with two factors (Session 13)

SADC Course in Statistics Analysis of Variance with two factors (Session 13)

Embed Size (px)

Citation preview

Page 1: SADC Course in Statistics Analysis of Variance with two factors (Session 13)

SADC Course in Statistics

Analysis of Variance with two factors

(Session 13)

Page 2: SADC Course in Statistics Analysis of Variance with two factors (Session 13)

2To put your footer here go to View > Header and Footer

Learning Objectives

At the end of this session, you will be able to

• understand and interpret the components of a linear model with two categorical factors

• fit a model involving two factors, interpret the output and present the results

• understand the difference between raw means and adjusted means

• appreciate that a residual analysis is the same with more complex models

Page 3: SADC Course in Statistics Analysis of Variance with two factors (Session 13)

3To put your footer here go to View > Header and Footer

Using Paddy again!

In the paddy example, there were two categorical factors, variety and village.

Here we will look at a model including both factors and the corresponding output.

We will also discuss assumptions associated with anova models with categorical factors and procedures to check these assumptions.

Page 4: SADC Course in Statistics Analysis of Variance with two factors (Session 13)

4To put your footer here go to View > Header and Footer

A model using two factors

Objective here is to compare paddy yields across the 3 varieties and also across villages.

A linear model for this takes the form:

yij = 0 + vi + gj + ij

Here 0 represents a constant, and the gj (i=1,2,3)

represent the variety effect as before.

We also have the term vi (i=1,2,3,4) to represent the

village effect.

Page 5: SADC Course in Statistics Analysis of Variance with two factors (Session 13)

5To put your footer here go to View > Header and Footer

Anova resultsSource d.f. S.S. M.S. F Prob.

Village 3 13.91 4.64 14.0 0.000

Variety 2 25.68 12.84 38.7 0.000

Residual 30 9.95 0.3318

Total 35 49.55

Above is a two-way anova since there are two factors explaining the variability in paddy yields.

Again the Residual M.S. (s2) = 0.3318 describes the variation not explained by village and variety.

Page 6: SADC Course in Statistics Analysis of Variance with two factors (Session 13)

6To put your footer here go to View > Header and Footer

Sample sizes

Above shows data is not balanced. Hence need to worry about the order of fitting terms. How then should we interpret the sequential S.S.’s shown in slide 5 anova?

--------+-----------------------+------- | Variety | Village | New Old Trad | Total--------+-----------------------+------- KESEN | 0 3 4 | 7 NANDA | 2 7 5 | 14 NIKO | 0 2 3 | 5 SABEY | 2 5 3 | 10 --------+-----------------------+------- Total | 4 17 15 | 36 --------+-----------------------+-------

Page 7: SADC Course in Statistics Analysis of Variance with two factors (Session 13)

7To put your footer here go to View > Header and Footer

Anova with adjusted SS and MS

Source d.f. Adj.S.S. Adj.M.S. F Prob.

Village 3 4.32 1.44 4.34 0.012

Variety 2 25.68 12.84 38.7 0.000

Residual 33 9.95 0.3318

Total 35 49.55

How may the above results be interpreted?

What are your conclusions?

Page 8: SADC Course in Statistics Analysis of Variance with two factors (Session 13)

8To put your footer here go to View > Header and Footer

Model estimates

Parameter Coeff. Std.error t t prob

0 :constant 5.284 0.386 13.7 0.000

v2 (Nanda) 0.718 0.272 2.63 0.013

v3 (Niko) -0.179 0.337 -0.53 0.599

v4 (Sabey) 0.633 0.294 2.16 0.039

g2 (old) -1.201 0.327 -3.67 0.001

g3 (trad) -2.614 0.340 -7.68 0.000

What do these results tell us?

Page 9: SADC Course in Statistics Analysis of Variance with two factors (Session 13)

9To put your footer here go to View > Header and Footer

Relating estimates to means

Again: Old - New = -1.201 = Estimate of g2

Trad - New = -2.614 = Estimate of g3

This is similar to the case with one categorical factor – can make comparisons easily with the “base” level using model estimates.

But when sample sizes are unequal across the two categorical factors, results should be reported in terms of adjusted means!

Page 10: SADC Course in Statistics Analysis of Variance with two factors (Session 13)

10To put your footer here go to View > Header and Footer

Raw means and adjusted means

Sample Raw Std.error

Variety Size(n) Means (s.d./n)

New improved 4 5.96 0.128

Old improved 17 4.54 0.173

Traditional 15 3.00 0.168

Variety Adjusted means Std.error (s/n)

New improved 5.58 0.308

Old improved 4.38 0.148

Traditional 2.96 0.150

Model based summaries (adjusted means):

Page 11: SADC Course in Statistics Analysis of Variance with two factors (Session 13)

11To put your footer here go to View > Header and Footer

Computing adjusted meansThe model equation

yij = 0 + vi + gj + ij

can be used to find the variety adjusted means

e.g. adjusted mean for traditional variety is:

= 5.284+0.25[0+0.718–0.179+0.633]–2.614

= 2.963

Thus the variety adjusted mean is an average over the 4 villages.

1 2 3 40 3

ˆ ˆ ˆ ˆv v v vˆ g4

Page 12: SADC Course in Statistics Analysis of Variance with two factors (Session 13)

12To put your footer here go to View > Header and Footer

Checking model assumptions

Anova model with two categorical factors is:

yij = 0 + gi + vj + ij

Model assumptions are associated with the ij.

These are checked in exactly the same way as before.

A residual analysis is done, looking at plots of residuals in various ways.

We give below a residual analysis for the model fitted above.

Page 13: SADC Course in Statistics Analysis of Variance with two factors (Session 13)

13To put your footer here go to View > Header and Footer

Histogram to check normality

Histogram of standardised residuals after fitting a model of yield on village and variety.

0.1

.2.3

.4.5

De

nsity

-2 -1 0 1 2Standardized residuals

Page 14: SADC Course in Statistics Analysis of Variance with two factors (Session 13)

14To put your footer here go to View > Header and Footer

A normal probability plot…

Another check on the normality assumption

Do you think the points follow a straight line?

-2-1

01

2S

tand

ard

ize

d re

sidu

als

-2 -1 0 1 2Inverse Normal

Page 15: SADC Course in Statistics Analysis of Variance with two factors (Session 13)

15To put your footer here go to View > Header and Footer

Std. residuals versus fitted values

Checking assumption of variance homogeneity, and identification of outliers:

What can you say here about the variance homogeneity assumption?

-2-1

01

2S

tand

ard

ize

d re

sidu

als

2 3 4 5 6Fitted values

Page 16: SADC Course in Statistics Analysis of Variance with two factors (Session 13)

16To put your footer here go to View > Header and Footer

Finally… know your softwareDifferent software packages impose different constraints on model parameters so need to be aware what this is.

For example, Stata and Genstat set the first level of the factor to zero. SPSS and SAS set the last level to zero. Minitab imposes a constraint that sets the sum of the parameter estimates to zero!

Check also whether the software produces sequential or adjusted or some other form of sums of squares. The correct interpretation of anova results would depend on this.

Page 17: SADC Course in Statistics Analysis of Variance with two factors (Session 13)

17To put your footer here go to View > Header and Footer

Practical work follows to ensure learning objectives are

achieved…