Annova Test

Analysis of variance

In statistics, analysis of variance (ANOVA) is a collection of statistical models, and their

associated procedures, in which the observed variance is partitioned into components due to

different explanatory variables. In its simplest form ANOVA provides a statistical test of

whether or not the means of several groups are all equal, and therefore generalizes Student's two-

sample t-test to more than two groups. ANOVAs are helpful because they possess a certain

advantage over a two-sample t-test. Doing multiple two-sample t-tests would result in a largely

increased chance of committing a type I error. For this reason, ANOVAs are useful in comparing

three or more means.

Overview

There are three conceptual classes of such models:

1. Fixed-effects models assume that the data came from normal populations which may

differ only in their means. (Model 1)

2. Random effects models assume that the data describe a hierarchy of different populations

whose differences are constrained by the hierarchy. (Model 2)

3. Mixed-effect models describe situations where both fixed and random effects are present.

(Model 3)

In practice, there are several types of ANOVA depending on the number of treatments and the

way they are applied to the subjects in the experiment:

One-way ANOVA is used to test for differences among two or more independent groups.

Typically, however, the one-way ANOVA is used to test for differences among at least three

groups, since the two-group case can be covered by a t-test (Gosset, 1908). When there are only

two means to compare, the t-test and the F-test are equivalent; the relation between ANOVA

and t is given by F = t2.

Factorial ANOVA is used when the experimenter wants to study the effects of two or

more treatment variables. The most commonly used type of factorial ANOVA is the 22 (read

"two by two") design, where there are two independent variables and each variable has two

levels or distinct values. However, such use of ANOVA for analysis of 2k factorial designs and

fractional factorial designs is "confusing and makes little sense"; instead it is suggested to refer

the value of the effect divided by its standard error to a t-table.[1] Factorial ANOVA can also be

multi-level such as 33, etc. or higher order such as 2×2×2, etc. but analyses with higher numbers

of factors are rarely done by hand because the calculations are lengthy. However, since the

introduction of data analytic software, the utilization of higher order designs and analyses has

become quite common.

Repeated measures ANOVA is used when the same subjects are used for each treatment

(e.g., in a longitudinal study). Note that such within-subjects designs can be subject to carry-over

effects.

Mixed-design ANOVA:

When one wishes to test two or more independent groups subjecting the subjects to

repeated measures, one may perform a factorial mixed-design ANOVA, in which one factor is a

between-subjects variable and the other is within-subjects variable. This is a type of mixed-effect

model.

Multivariate analysis of variance (MANOVA) is used when there is more than one dependent

variable.

MODELS:

Fixed-effects models (Model 1)

The fixed-effects model of analysis of variance applies to situations in which the

experimenter applies several treatments to the subjects of the experiment to see if the response

variablevalues change. This allows the experimenter to estimate the ranges of response variable

values that the treatment would generate in the population as a whole.

Random-effects models (Model 2)

Random effects models are used when the treatments are not fixed. This occurs when the

various treatments (also known as factor levels) are sampled from a larger population. Because

the treatments themselves are random variables, some assumptions and the method of contrasting

the treatments differ from ANOVA model 1.

Most random-effects or mixed-effects models are not concerned with making inferences

concerning the particular sampled factors. For example, consider a large manufacturing plant in

which many machines produce the same product. The statistician studying this plant would have

very little interest in comparing the three particular machines to each other. Rather, inferences

that can be made for all machines are of interest, such as their variability and the mean.

Assumptions of ANOVA

There are several approaches to the analysis of variance.

A model often presented in textbooks

Many textbooks present the analysis of variance in terms of a linear model, which makes the

following assumptions:

Independence of cases – this is an assumption of the model that simplifies the statistical analysis.

Normality – the distributions of the residuals are normal.

Equality (or "homogeneity") of variances, called homoscedasticity — the variance of data

in groups should be the same. Model-based approaches usually assume that the variance is

constant. The constant-variance property also appears in the randomization (design-based)

analysis of randomized experiments, where it is a necessary consequence of the randomized

design and the assumption of unit treatment additivity (Hinkelmann and Kempthorne): If the

responses of a randomized balanced experiment fail to have constant variance, then the

assumption of unit treatment additivity is necessarily violated.

Levene's test for homogeneity of variances is typically used to examine the plausibility

of homoscedasticity. The Kolmogorov–Smirnov or the Shapiro–Wilk test may be used to

examine normality.

When used in the analysis of variance to test the hypothesis that all treatments have

exactly the same effect, the F-test is robust (Ferguson & Takane, 2005, pp. 261–2). The Kruskal–

Wallis test is a nonparametric alternative which does not rely on an assumption of normality.

And the Friedman test is the nonparametric alternative for a one way repeated measures

ANOVA.

The separate assumptions of the textbook model imply that the errors are independently,

identically, and normally distributed for fixed effects models, that is, that the errors are

independent and

Randomization-based analysis:

In a randomized controlled experiment, the treatments are randomly assigned to

experimental units, following the experimental protocol. This randomization is objective and

declared before the experiment is carried out. The objective random-assignment is used to test

the significance of the null hypothesis, following the ideas of C. S. Peirce and Ronald A. Fisher.

This design-based analysis was discussed and developed by Francis J. Anscombe at Rothamsted

Experimental Station and by Oscar Kempthorne at Iowa State University. Kempthorne and his

students make an assumption of unit treatment additivity, which is discussed in the books of

Kempthorne and David R. Cox.

Unit-treatment additivity :

In its simplest form, the assumption of unit-treatment additivity states that the observed

response yi,j from experimental unit i when receiving treatment j can be written as the sum of the

unit's response yi and the treatment-effect tj, that is

yi,j = yi + tj.[4]

The assumption of unit-treatment addivity implies that, for every treatment j, the jth treatment

have exactly the same effect tj on every experiment unit.

The assumption of unit treatment additivity usually cannot be directly falsified, according

to Cox and Kempthorne. However, many consequences of treatment-unit additivity can be

falsified. For a randomized experiment, the assumption of unit-treatment additivity implies that

the variance is constant for all treatments. Therefore, by contraposition, a necessary condition for

unit-treatment additivity is that the variance is constant.

The property of unit-treatment additivity is not invariant under a "change of scale", so

statisticians often use transformations to achieve unit-treatment additivity. If the response

variable is expected to follow a parametric family of probability distributions, then the

statistician may specify (in the protocol for the experiment or observational study) that the

responses be tranformed to stabilize the variance.[5]

Also, a statistician may specify that

logarithmic transforms be applied to the responses, which are believed to follow a multiplicative

model.

The assumption of unit-treatment additivity was enunciated in experimental design by

Kempthorne and Cox. Kempthorne's use of unit treatment additivity and randomization is similar

to the design-based inference that is standard in finite-population survey sampling.

DERIVED LINEAR MODEL:

Kempthorne uses the randomization-distribution and the assumption of unit treatment

additivity to produce a derived linear model, very similar to the textbook model discussed

previously.

The test statistics of this derived linear model are closely approximated by the test

statistics of an appropriate normal linear model, according to approximation theorems and

simulation studies by Kempthorne and his students (Hinkelmann and Kempthorne). However,

there are differences. For example, the randomization-based analysis results in a small but

(strictly) negative correlation between the observations (Hinkelmann and Kempthorne, volume

one, chapter 7; Bailey chapter 1.14). In the randomization-based analysis, there is no

assumption of anormal distribution and certainly no assumption of independence. On the

contrary, the observations are dependent!

The randomization-based analysis has the disadvantage that its exposition involves

tedious algebra and extensive time. Since the randomization-based analysis is complicated and is

closely approximated by the approach using a normal linear model, most teachers emphasize the

normal linear model approach. Few statisticians object to model-based analysis of balanced

randomized experiments.

Statistical models for observational data:

However, when applied to data from non-randomized experiments or observational

studies, model-based analysis lacks the warrant of randomization. For observational data, the

derivation of confidence intervals must use subjective models, as emphasized by Ronald A.

Fisher and his followers. In practice, the estimates of treatment-effects from observational

studies generally are often inconsistent (Freedman). In practice, "statistical models" and

observational data are useful for suggesting hypothesis that should be treated very cautiously by

the public (Freedman).

Logic of ANOVA

Partitioning of the sum of squares

The fundamental technique is a partitioning of the total sum of squares (abbreviated SS)

into components related to the effects used in the model. For example, we show the model for a

simplified ANOVA with one type of treatment at different levels.

So, the number of degrees of freedom (abbreviated df) can be partitioned in a similar way and

specifies the chi-square distribution which describes the associated sums of squares.

See also Lack-of-fit sum of squares.

The F-test

The F-test is used for comparisons of the components of the total deviation. For example,

in one-way, or single-factor ANOVA, statistical significance is tested for by comparing the F test

statistic

Where

I = number of treatments

And

nT = total number of cases

to the F-distribution with I − 1,nT − I degrees of freedom. Using the F-distribution is a natural

candidate because the test statistic is the quotient of two mean sums of squares which have achi-

square distribution.

ANOVA on ranks:

When the data do not meet the assumptions of normality, the suggestion has arisen to

replace each original data value by its rank (from 1 for the smallest to N for the largest), then run

a standard ANOVA calculation on the rank-transformed data. Conover and Iman (1981)

provided a review of the four main types of rank transformations. Commercial statistical

software packages (e.g., SAS, 1985, 1987, 2008) followed with recommendations to data

analysts to run their data sets through a ranking procedure (e.g., PROC RANK) prior to

conducting standard analyses using parametric procedures.

This rank-based procedure has been recommended as being robust to non-normal errors,

resistant to outliers, and highly efficient for many distributions. It may result in a known statistic

(e.g., Wilcoxon Rank-Sum / Mann-Whitney U), and indeed provide the desired robustness and

increased statistical power that is sought. For example, Monte Carlo studies have shown that the

rank transformation in the two independent samples t test layout can be successfully extended to

the one-way independent samples ANOVA, as well as the two independent samples multivariate

Hotelling's T2 layouts (Nanna, 2002).

Conducting factorial ANOVA on the ranks of original scores has also been suggested

(Conover & Iman, 1976, Iman, 1974, and Iman & Conover, 1976). However, Monte Carlo

studies by Sawilowsky (1985a; 1989 et al.; 1990) and Blair, Sawilowsky, and Higgins (1987),

and subsequent asymptotic studies (e.g. Thompson & Ammann, 1989; "there exist values for the

main effects such that, under the null hypothesis of no interaction, the expected value of the rank

transform test statistic goes to infinity as the sample size increases," Thompson, 1991, p. 697),

found that the rank transformation is inappropriate for testing interaction effects in a 4x3 and a

2x2x2 factorial design. As the number of effects (i.e., main, interaction) become non-null, and as

the magnitude of the non-null effects increase, there is an increase in Type I error, resulting in a

complete failure of the statistic with as high as a 100% probability of making a false positive

decision. Similarly, Blair and Higgins (1985) found that the rank transformation increasingly

fails in the two dependent samples layout as the correlation between pretest and posttest scores

increase. Headrick (1997) discovered the Type I error rate problem was exacerbated in the

context of Analysis of Covariance, particularly as the correlation between the covariate and the

dependent variable increased. For a review of the properties of the rank transformation in

designed experiments see Sawilowsky (2000).

A variant of rank-transformation is 'quantile normalization' in which a further

transformation is applied to the ranks such that the resulting values have some defined

distribution (often a normal distribution with a specified mean and variance). Further analyses of

quantile-normalized data may then assume that distribution to compute significance values.

However, two specific types of secondary transformations, the random normal scores and

expected normal scores transformation, have been shown to greatly inflate Type I errors and

severely reduce statistical power (Sawilowsky, 1985a, 1985b).

EFFECT SIZE MEASURES:

Several standardized measures of effect are used within the context of ANOVA to

describe the degree of relationship between a predictor or set of predictors and the dependent

variable. Effect size estimates are reported to allow researchers to compare findings in studies

and across disciplines. Common effect size estimates reported in bivariate (e.g. ANOVA) and

multivariate (MANOVA, ANCOVA, Multiple Discriminant Analysis) statistical analysis

includes eta-squared, partial eta-squared, omega, and intercorrelation (Strang, 2009).

η2 (eta-squared): Eta-squared describes the ratio of variance explained in the dependent variable

by a predictor while controlling for other predictors. Eta-squared is a biased estimator of the

variance explained by the model in the population (it only estimates effect size in the sample).

On average it overestimates the variance explained in the population. As the sample size gets

larger the amount of bias gets smaller. It is, however, an easily calculated estimator of the

proportion of the variance in a population explained by the treatment. Note that earlier versions

of statistical software (such as SPSS) incorrectly reports Partial eta squared under the misleading

title "Eta squared".

Partial η2 (Partial eta-squared): Partial eta-squared describes the "proportion of total variation

attributable to the factor, partialling out (excluding) other factors from the total nonerror

variation" (Pierce, Block & Aguinis, 2004, p. 918). Partial eta squared is normally higher than

eta squared (except in simple one-factor models).

Several variations of benchmarks exist.

The generally accepted regression benchmark for effect size comes from (Cohen, 1992;

1988): 0.20 is a minimal solution (but significant in social science research); 0.50 is a medium

effect; anything equal to or greater than 0.80 is a large effect size (Keppel & Wickens, 2004;

Cohen, 1992).

Because this common interpretation of effect size has been repeated from Cohen (1988)

over the years with no change or comment to validity for contemporary experimental research, it

is questionable outside of psychological/behavioural studies, and more so questionable even then

without a full understanding of the limitations ascribed by Cohen. Note: The use of specific

partial eta-square values for large medium or small as a "rule of thumb" should be avoided.

Nevertheless, alternative rules of thumb have emerged in certain disciplines: Small =

0.01; medium = 0.06; large = 0.14 (Kittler, Menard & Phillips, 2007).

Omega Squared Omega squared provides a relatively unbiased estimate of the variance

explained in the population by a predictor variable. It takes random error into account more so

than eta squared, which is incredibly biased to be too large. The calculations for omega squared

differ depending on the experimental design. For a fixed experimental design (in which the

categories are explicitly set), omega squared is calculated as follows:

Cohen's ƒ this measure of effect size is frequently encountered when performing power analysis

calculations. Conceptually it represents the square root of variance explained over variance not

explained.

Follow up tests

A statistically significant effect in ANOVA is often followed up with one or more

different follow-up tests. This can be done in order to assess which groups are different from

which other groups or to test various other focused hypotheses. Follow up tests are often

distinguished in terms of whether they are planned (a priori) or post hoc. Planned tests are

determined before looking at the data and post hoc tests are performed after looking at the data.

Post hoc tests such as Tukey's range test most commonly compare every group mean with every

other group mean and typically incorporate some method of controlling for Type I errors.

Comparisons, which are most commonly planned, can be either simple or compound. Simple

comparisons compare one group mean with one other group mean. Compound comparisons

typically compare two sets of groups means where one set has at two or more groups (e.g.,

compare average group means of group A, B and C with group D). Comparisons can also look at

tests of trend, such as linear and quadratic relationships, when the independent variable involves

ordered levels.

Power analysis

Power analysis is often applied in the context of ANOVA in order to assess the

probability of successfully rejecting the null hypothesis if we assume a certain ANOVA design,

effect size in the population, sample size and alpha level. Power analysis can assist in study

design by determining what sample size would be required in order to have a reasonable chance

of rejecting the null hypothesis.

Examples

In a first experiment, Group A is given vodka, Group B is given gin, and Group C is

given a placebo. All groups are then tested with a memory task. A one-way ANOVA can be used

to assess the effect of the various treatments (that is, the vodka, gin, and placebo).

In a second experiment, Group A is given vodka and tested on a memory task. The same group is

allowed a rest period of five days and then the experiment is repeated with gin. The procedure is

repeated using a placebo. A one-way ANOVA with repeated measures can be used to assess the

effect of the vodka versus the impact of the placebo.

In a third experiment testing the effects of expectations, subjects are randomly assigned to four

groups:

expect vodka—receive vodka

expect vodka—receive placebo

expect placebo—receive vodka

expect placebo—receive placebo (the last group is used as the control group)

Each group is then tested on a memory task. The advantage of this design is that multiple

variables can be tested at the same time instead of running two different experiments. Also, the

experiment can determine whether one variable affects the other variable (known as interaction

effects). A factorial ANOVA (2×2) can be used to assess the effect of expecting vodka or the

placebo and the actual reception of either.

History

The analysis of variance was used informally by researchers in the 1800s using least

squares. In physics and psychology, researchers included a term for the operator-effect, the

influence of a particular person on measurements, according to Stephen Stigler's histories.

In its modern form, the analysis of variance was one of the many important

statistical innovations of Ronald A. Fisher. Fisher proposed a formal analysis of variance in his

1918 paper The Correlation Between Relatives on the Supposition of Mendelian Inheritance. His

first application of the analysis of variance was published in 1921. Analysis of variance became

widely known after being included in Fisher's 1925 book Statistical Methods for Research

Workers.

Documents

Annova Test