Notes

1

Design of Experiments

and

Analysis of Variance

(4MMC, 4BI)

November 2005

dr. ir. G. Jongbloed

2

PrefaceThis course for students informatiekunde consists of four lectures. One lec-ture gives a motivation for and description of empirical research, making theconnection with statistics. In the academic year 2005-2006 this lecture willbe the third of the four.

The other three lectures correspond to the material in these notes. Eachchapter is supported by practical exercises that are to be done using thestatistical package R. These will be made available via the course web page.The subjects covered in these notes are

1) The 2- and k-sample problem

2) Basic experimental design and analysis of (co)variance

One could say that particular stochastic models are studied that can often beused to answer interesting questions of an empirical researcher. These modelsare not studied in depth from a theoretical point of view. The available timeand background of the students the course is aimed at do not allow for that.The aim is, however, that students will understand the proposed models, willbe able to apply these in practice and interpret the results of the analysescorrectly.

The statistical package R can and will be used to set up and analyseexperiments as discussed in these notes. It will also be used for stochasticsimulation and as such hopefully help the students to get some feeling forthe models at hand.

Any suggestions to improve the presentation of the material in these notesare welcome!

Amsterdam, November 1, 2005.

Chapter 1

The k-sample problem

1.1 Introduction

In many research areas, it is important to assess differences between certaingroups. If some type of medicine is to be replaced by a new medicine, it has tobe proved that the new medicine is better than the old medicine (or that thenegative side effects are less severe). If one type of tire is more expensive thananother type, it would help in the marketing strategy if research has shownthat the more expensive type of tire also ‘survives’ longer than the cheapertire, under similar circumstances. For management of an international webshop it is important to know whether customers from different countries havedifferent associations with certain colors, affecting their buying behavior. Ifthis difference is present, its nature can be investigated and the shop canadapt the page colors depending on the country the potential customer isfrom.

In this chapter we consider the often encountered k-sample problem,where k populations are to be compared based on a sample from each. Thetwo-sample problem is a specific example of the k-sample problem and thisis be the subject of section 1.2. We will derive a test statistic for this situa-tion and describe the procedure to be followed to assess ‘equality of the twopopulations’ using this two-sample t test statistic.

Section 1.3 deals with the k-sample problem. There a generalization ofthe two-sample t-statistic is derived which can be interpreted as a ratio of twovariances. The variance between the group means and the variance of theindividual elements in the sample relative to their own group mean. If the

3

4 CHAPTER 1. THE K-SAMPLE PROBLEM

null hypothesis of equality would hold, this test statistic should have valuesclose to one. If the value is much bigger than one, this is ‘statistical evidence’against the null hypothesis and the null hypothesis should be rejected.

Stochastic simulation is an important tool in modern statistics. In par-ticular, it can be used to develop some more intuition for the material thatis presented in this chapter. Section 1.4 considers a number of elementary Rfunctions that can be used to this end.

1.2 Two sample t-test

A classical setting in statistics is the situation where two comparable samplesof data are given and one is interested in a possible difference in propertiesof the populations these samples are from.

Example The following data are measurements of the heat-producing ca-pacity (in millions of calories per ton) of specimens of coal from two mines.

Mine 1 8260 8130 8350 8070 8340Mine 2 7950 7890 7900 8140 7920 7840

The mining company is interested in whether or not the expected heat-producing capacities of specimens of coal from the two mines are equal. Thefollowing lines of code in R,> x<-c(8260,8130,8350,8070,8340)

> y<-c(7950,7890,7900,8140,7920,7840)

> boxplot(x,y,col="cyan")

produce the boxplots below. These visualize the data for the two mines.

1 2

7900

8000

8100

8200

8300

1.2. TWO SAMPLE T -TEST 5

Remember that a boxplot has a box, extending from the first quartile to thethird quartile (and thus containing half of all the data points). Moreover,the horizontal line in the box denotes the second quartile (or median). Thewhiskers of the boxplot indicate the maximal value in the data set in theregion extending from the third quartile to the third quartile plus 1.5 timesthe inter quartile range (difference between third and first quartile; heightof the box) and the minimal value in the data set in the region between thefirst quartile minus 1.5 times the inter quartile range and this first quartile.Individual data points that are situated outside this region are indicated bya point.

√

A stochastic model for this situation is to assume that the two samplesX1, X2, . . . , Xn and Y1, Y2, . . . , Ym are independent. The question then be-comes whether or not the two samples come from the same distribution.Under the assumption that both samples come from a normal distributionwith (common) variance parameter σ2 and (possibly differing) expectationsµX and µY , this question can be translated to a formal null hypothesisH0 : µX = µY .

The normal distribution plays an incredibly important role in (ap-

plied) statistics. In R one can simulate a sample from a normal dis-

tribution using the function rnorm. Use the function shownormhist

given in section 1.4 to get some feeling for normally distributed data

and the influence of the expectation parameter µ and variance pa-

rameter σ2 on these data. See also chapter 5 in Bennett, Briggs and

Triola.

In order to test this hypothesis versus the alternative hypothesis H1 : µX 6=µY , a test statistic is used. This statistic (function of the available data) isbased on the fact that the sample mean of the Xi’s is a sensible estimator ofµX and the sample mean of the Yi’s for µY . Hence, the difference of the twosample means is an estimate of the difference between µX and µY :

Xn =1

n

n∑

i=1

Xi ≈ µX and Ym =1

m

m∑

i=1

Yi ≈ µY ⇒ Xn − Ym ≈ µX − µY

That this approximation holds, follows from the law of large numbers

(also known as law of averages, see Bennett, Briggs & Triola, chapter

6). Simulation can also be used to see this. The function checkdiff

described in section 1.4 helps to get some feeling for this.


If this difference of sample means is too big (either way, positive or nega-tive), this is an indication that the null hypothesis is not true. In order todetermine exactly what ‘too big’ means in this situation, something on theaccuracy of the difference of the sample means as estimates of the differenceof expectations (population means) has to be known. This accuracy can bemeasured in terms of the variance of the estimator, which is again related tothe (common) variance σ2 in the original variables.

It can be shown (see also Bennett, Briggs & Triola, p.212) that

Var(Xn) =σ2

nand Var(Ym) =

σ2

m.

From this it follows (since the samples are independent) that

Var(Xn − Ym) =σ2

n+σ2

m= σ2

(

1

n+

1

m

)

This expression shows that this variance decreases in n and m and increasesin σ2. For bigger samples (n and/or m bigger), Xn − Ym estimates µX − µY

more accurately than for small samples and if the original measurements Xi

and Yi are more accurate estimates of µX and µY (σ2 smaller), Xn− Ym willestimate µX − µY more accurately too.

The next step now is to standardize Xn − Ym by the square root of itsvariance and consider

Xn − Ym

σ√

1/n+ 1/m(1.1)

If the null hypothesis is true, this quantity has expectation zero and variance1. In fact, this quantity has a standard normal distribution (see chapter5 in Bennett, Briggs & Triola). Hence, compared to the unstandardizedXn− Ym, we now know something on the expected range of the standardizedtest statistic (of course, if the null hypothesis that µX = µY holds true). But,the standardized statistic is not a statistic in the sense that it only dependson the available data. It also depends on the unknown variance parameter σ2.By estimating this parameter based on the available data, and plugging inthis estimate in the standardized statistic, we obtain the well-known Studentt-statistic for this testing problem. More specifically, we define the pooled

estimate of variance, S2, by

S2 =(n− 1)S2

X,n + (m− 1)S2Y,m

n+m− 2(1.2)


where

S2X,n =

1

n− 1

n∑

i=1

(

Xi − Xn

)2and S2

Y,m =1

m− 1

m∑

i=1

(

Yi − Yn

)2

are the standard variance estimates based on the samples of Xi’s and Yi’srespectively (see Bennett, Briggs & Triola, p.176). The Student t-statistic isthen obtained by substituting S =

√S2 as defined in (1.2) for σ in (1.1):

T =Xn − Ym

S√

1/n+ 1/m(1.3)

Now note that this test statistic can be expected to be close to zeroif the null hypothesis is true. Hence, H0 is rejected in favor of H1 if theobserved value of T is too small (negative) or too big (positive). A basicresult from probability theory is that the test statistic T defined in (1.1)follows a Student t-distribution with n+m−2 degrees of freedom, regardlessthe precise value of σ2 if the null hypothesis is true. The origin of this‘degrees of freedom’ terminology will be explained in chapter 2. Figure 1.1shows the probability density of a Student t-distribution with 9 degrees offreedom. Now the decision process is to reject the null hypothesis at level

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

Figure 1.1: Probability density function of the t9 distribution.

α if the observed value of T is smaller than the α/2 percentage point (or‘quantile’) of the t-distribution with n + m − 2 degrees of freedom or if it


is bigger than the 1 − α/2 percentage point of that distribution. In Figure1.1 the rejection area consists of the x-coordinates corresponding the shadedregion (for α = 0.05). The area under the probability density is by definitionone, and the area of the shaded region is α. Note that the critical values(boundary points of the critical region) move into the tail of the density (tominus and plus infinity respectively) if α is taken smaller, approaching zero.Hence, one can in general say that if a null hypothesis is rejected at a certainlevel α1, it will also be rejected at all other levels α with α > α1. If α ischosen only slightly smaller than one, it is obvious that the critical valueswill approach zero from both sides. Hence, for a level α close enough to one,a given null hypothesis will be rejected.

If H0 is true, it is still possible that T takes its value in the critical

region. Depending on the choice made for the level α, it is, however,

not probable that T falls in the critical region. Indeed, if H0 is true,

this probability is exactly equal to α.

A more informative way of summarizing the conclusion of the test is togive its associated p-value. This is the smallest value of α for which thehypothesis would be rejected using the data at hand. The p-value is moreinformative since one can always answer the question whether or not thehypothesis should be rejected at level α if the p-value is known. Indeed, ifthe p-value is 0.177, then H0 will be rejected at level α = 0.05, but not atα = 0.01. Conversely, given a conclusion that the null hypothesis is rejectedat the α = 0.05 level, one cannot recover the p-value of the test or answerthe question whether or not the hypothesis is rejected at the α = 0.01 level.

Example (revisited) Let us go through the general setup for the miningexample. Note that n = 5 and m = 6 in this case, and the realized values ofX5 and Y6 are obtained in R via the code> mean(x)

[1] 8230

> mean(y)

[1] 7940

Hence, we observe a difference in mean x5 − y6 = 290. As such, we cannotconclude whether this value is big enough to reject the null hypothesis. Weneed to compute test statistic (1.3). To that end we first compute bothestimates of the variance and the pooled estimate:> s2X<-var(x)


> s2Y<-var(y)

> s2<-(4*s2X+5*s2Y)/(5+6-2)

> s2

[1] 13066.67

This means that the observed value of test statistic (1.3) can be computedas> t<-(mean(x)-mean(y))/(sqrt(s2)*sqrt(1/5+1/6))

> t

[1] 4.189671

For α = 0.05, we have as critical values for the t-distribution with n+m−2 =5 + 6− 2 = 9 degrees of freedom> qt(0.975,9)

[1] 2.262157

> qt(0.025,9)

[1] -2.262157

Following the decision rule, we reject the null hypothesis that µX = µY atlevel α = 0.05 whenever the test statistic is smaller than −2.262157 or biggerthan 2.262157. In view of the observed value, we decide to reject the nullhypothesis at level α = 0.05. See also figure 1.1 for the critical region in thisexample.

Now we compute the p-value associated with the observed data. If the1− α/2 percentage point of the t9-distribution is smaller than the observedvalue of 4.189671, the null hypothesis will be rejected. If this percentage pointis bigger than 4.189671, the null hypothesis will not be rejected. Hence, thep-value corresponds to twice the area under the probability density of the t9distribution, to the right of 4.189671. This quantity can be computed withR via> 2*(1-pt(t,9))

[1] 0.002342261

Hence, only at levels α smaller than this number, the null hypothesis wouldnot be rejected.

In R, many distributions are known. E.g. the normal and t distribu-

tions. For all these distributions, one can compute the density func-

tion, probability distribution function and quantile function using the

prefixes d, p and q respectively (e.g. dnorm, pnorm and qnorm). A

sample from the distribution of interest can be generated using the

prefix r; e.g. rnorm


It is good to see how the steps in the analysis can be made one at a time.In R one can also perform the two-sample t-test with one command:> t.test(x,y,var.equal=TRUE)

Two Sample t-test

data: x and y

t = 4.1897, df = 9, p-value = 0.002342

alternative hypothesis: true difference in means is not

equal to 0

95 percent confidence interval:

133.4183 446.5817

sample estimates:

mean of x mean of y

8230 7940

√

In order to get some feeling for the t-test, the function tsim in section

1.4 can be used. One call of that function performs a two-sample t-

test based on two simulated data sets from normal distributions with

common variance σ2.

Remark. The output of the two-sample t-test procedure in R also containsthe two end-points of a 95% confidence interval for the difference in expec-tation for the two populations, in the example above 133.4183 and 446.5817.Such a confidence set is also called an interval estimate for the differenceµX − µY , as opposed to the point estimate xn − ym for this quantity. Theinterpretation of the 95% is that this interval estimate is constructed ac-cording to a procedure that would result in an interval containing the trueµX − µY approximately 95% of a large number of replications of the wholeexperiment.

1.3 The F -test in the k-sample problem

In many practical situations, samples from more than two populations aregiven and the question is whether or not the associated populations meansare equal or not.

Example Consider three methods to teach programming of a certain com-puter language. Method A is straight computer-based instruction, without

1.3. THE F -TEST IN THE K-SAMPLE PROBLEM 11

the help of an instructor. Method B involves the personal attention of aninstructor and some direct experience working with the computer. MethodC involves the personal attention of an instructor but no work with the com-puter. Random samples of size 4 were taken from large groups of personstaught by the various methods. The following are the scores they obtainedin an appropriate achievement test.

Method A 73 77 67 71Method B 91 81 87 85Method C 72 77 76 79

It is of interest to assess whether there is a difference in effectiveness betweenthe three methods of teaching. Via the lines> ma<-c(73,77,67,71)

> mb<-c(91,81,87,85)

> mc<-c(72,77,76,79)

the data are loaded in R. Figure 1.2 visualizes the data. Note that in thisexample n1 = n2 = n3 = 4 and n = 12.

73

77

67

71

91

81

8785

72

7776

79

(A) (B) (C)

Figure 1.2: Visualization of teaching method data.

√

Using the two sample t-test, one could test pairwise hypotheses. E.g., thatthe expected score for a student being taught using method A and method Bare the same or not. This test can, however, not be used to test for equalityof the three expectations. In this section, we propose a stochastic model


for the situation with k populations, state the relevant null hypothesis andderive an informative test statistic.

Consider k samples, all from normal populations with (unknown, com-mon) variance parameter σ2. We are interested in whether or not the ex-pectations of these k populations, denoted by µ1, . . . , µk, are equal. Moreformally, we have for 1 ≤ i ≤ k a sample

Yi1, Yi2, . . . , Yini

from a normal population with mean µi and variance σ2. We wish to testthe null hypothesis

H0 : µ1 = µ2 = · · · = µk

against the alternative that this equality does not hold.

The alternative hypothesis does not state that ‘all µi’s are different’,

it states that they are not all equal.

Note that the size of the sample does not have to be equal for all samples;the sample size in the i-th group is denoted by ni. The total number ofobservations is denoted by n =

∑ki=1 ni.

This testing problem is a generalized version of the two-sample problemconsidered in the preceding section (k = 2). In this section we derive a teststatistic for this type of hypothesis and see how it boils down to analyzingappropriate variances, a technique called Analysis of Variance (ANOVA). Inchapter 2 this type of analysis will be used in an experimental setting wherethe samples correspond to individuals having received different treatments.

To start with, consider the Student t-statistic for the two-sample problemas defined in (1.3), in the situation where n = m. It is not clear how thistest statistic can be generalized to a test statistic that can be used for thek-sample problem. Instead of using T in the two-sample problem, we couldjust as well use the statistic T 2, rejecting the null hypothesis for large valuesof T 2. Write

T 2 =(Xn1

− Yn1)2

S2(2/n1)

and note that

(

Xn1− Yn1

)2= 2

(

Xn1− Xn1

+ Yn1

2

)2

+ 2

(

Yn1− Xn1

+ Yn1

2

)2

.


This is just the twice the sample variance of the two quantities Xn1and Yn1

.Using this expression and that n1 = n2 = n/2, we get that

T 2 = n1

(

Xn1− Xn1

+Yn1

2

)2

+(

Yn1− Xn1

+Yn1

2

)2

S2

=n

2

(

Xn1− Xn1

+Yn2

2

)2

+(

Yn2− Xn1

+Yn2

2

)2

12

∑n1

i=1(Xi − Xn1)2 + 1

2

∑n2

i=1(Yi − Yn2)2

and see that T 2 is the ratio of two estimates of variance. The variance betweengroup means and the variance estimated within the groups. Observing alarge value of T 2 therefore means that the variance between the variousgroup means is big compared to the variance that is observed within the twosamples. It is a basic result from probability theory that T 2 has a so-calledFisher distribution with 1 and n−2 degrees of freedom if the null hypothesisis true:

T 2 ∼ F1,n−2 if µX = µY

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.2

0.4

0.6

0.8

1.0

(2,9)(5,12)(8,20)

Figure 1.3: Probability density function of various Fp,q distributions.

Now return to the k-sample problem. The interpretation of T 2 as a ratioof two estimates of variance (between groups and within groups), suggests astraightforward generalization for the k-sample problem:

F =

∑ki=1 ni(Yi − Y )2/(k − 1)

∑ki=1

∑ni

j=1(Yij − Yi)2/(n− k)


Here Yi denotes the mean of the observed values in the i-th group:

Yi =1

ni

ni∑

j=1

Yij

Another notation for this quantity (to be used extensively in chapter 2), isYi·. The dot being shorthand for ‘mean taken over this index’. Using thisconvention we can also write

F =

∑ki=1 ni(Yi· − Y

··)2/(k − 1)

∑ki=1

∑ni

j=1(Yij − Yi·)2/(n− k)

This F -test statistic can be shown to have a Fisher distribution with k−1and n − k degrees of freedom if the null hypothesis holds true. Hence, thenull hypothesis is rejected at level α, if the observed value of F is bigger thanthe critical value of the F -distribution with parameters k−1 and n−k, withlevel α (the 1 − α quantile of this distribution). Stated differently, the nullhypothesis is rejected if the p-value associated with the observed value of Fis smaller than the level α.

Example (revisited) To compute the F -statistic in the teaching methodexample, we need to compute a couple of sums of squares. The sum ofsquared deviations of the group means from the overall mean can be com-puted as> mvec<-c(mean(ma),mean(mb),mean(mc)); mvec

[1] 72 86 76

> ss1<-4*sum((mvec-mean(mvec))ˆ2); ss1

[1] 416

The sum of squares in the denominator of the F statistic can be computed bysumming the three sums of squared deviations of the measurements withinthe three groups. In R this boils down to> ss2<-sum((ma-mvec[1])ˆ2)+sum((mb-mvec[2])ˆ2)++ sum((mc-mvec[3])ˆ2);ss2[1] 130

This means that the F -statistic in this situation is given by> f<-9*ss1/(2*ss2); f

[1] 14.4

Hence, using that the null hypothesis is to be rejected for large values of F ,the p-value can be computed as


> 1-pf(f,2,9)

[1] 0.001568116

This result means that if the null hypothesis were true, the realized valueof the F statistic is exceptionally large. It would only occur about 15 timesif the experiment would be repeated ten thousand times. If the level of thetest is chosen to be α = 0.01, this realization leads to rejection of the nullhypothesis. Note that Figure 1.3 shows the density of the F2,9 distributionas a solid line. The 0.01 critical value of this distribution (so the point tothe right of which there is exactly an area of 0.01 below the density curve)is given by> qf(0.99,2,9)

[1] 8.021517

Just as in the two-sample situation, there is a more efficient way of com-puting the F -statistic and its associated p-value. The function aov in R canbe used to that end. We will encounter that function again in chapter 2.> mdat<-data.frame(meas=c(ma,mb,mc),meth=rep(c("A","B","C"),

+ each=4))

> meth.aov<-aov(meas∼meth,data=mdat)> summary(meth.aov)

Df Sum Sq Mean Sq F value Pr(>F)

meth 2 416.00 208.00 14.4 0.001568 **

Residuals 9 130.00 14.44

---

Signif.codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This so-called ANOVA table shows two sums of squares, two means of squares(the associated sum divided by its corresponding number of degrees of free-dom), the F -statistic (ratio of two means of squares) and its associated p-value. For the aov function it is necessary to represent the data in a so-calleddata frame. This is a matrix in which different columns may be of differ-ent data type (e.g. character and numeric). The model formula in the aov

function (meas∼meth) is needed to identify the different groups by the factormeth.

√

Just as in the case of the two sample t-test, one can use simulation

to get some feeling for the k-sample test. In section 1.4, the function

Fsim is given that can be used for that.


1.4 Some R functions to play with

In this section the R code of some functions referred to in this chapter canbe found. These are also available electronically and can be copied into theR commands window.

Function shownormhist

This function simulates a sample of size n from the normal distribution withexpectation mu and standard deviation sd. It draws a histogram of the dataand adds the corresponding probability density function to this plot. It isinstructive to simulate with this function, using various values of n, mu andsd.

> shownormhist<-function(n,mu=0,sd=1){+ x<-rnorm(n,mu,sd)

+ hist(x,prob=T)

+ as<-seq(min(x),max(x),length=200)

+ lines(as,dnorm(as,mu,sd))}

Function checkdiff

This function shows the difference between two sample means of simulatednormally distributed data. For large values of n and m, this difference inmeans should be close to the difference in the expected values of X and Y ,i.e. µX − µY .

> checkdiff<-function(n=100,m=100,muX=1,muY=0,sd=1){+ mean(rnorm(n,muX,sd))-mean(rnorm(m,muY,sd))}

Function tsim

The function tsim performs the t-test based on two simulated samples. If thenull hypothesis is true (µX = µY ), then the proportion of simulations wherethe p-value is below 0.05 is about 0.05, reflecting the fact that under thenull hypothesis the probability of rejecting this hypothesis equals the level ofsignificance. If the null hypothesis is not true (so µX 6= µY ), the proportionof simulations where the p-value is below 0.05 is smaller than 0.05. Thebigger the actual difference between µX and µY , the more frequently the nullhypothesis will be rejected.

> tsim<-function(n=10,m=10,muX=0,muY=0,sd=1){+ x<-rnorm(n,muX,sd)

+ y<-rnorm(m,muY,sd)

1.4. SOME R FUNCTIONS TO PLAY WITH 17

+ t.test(x,y,var.equal=TRUE)}

Function Fsim

The function Fsim performs the F-test based on k simulated samples. Ifthe null hypothesis is true (all expectations equal), then the proportion ofsimulations where the p-value is below 0.05 is about 0.05, reflecting the factthat under the null hypothesis the probability of rejecting this hypothesisequals the level of significance. If the null hypothesis is not true (so notall expectations equal), the expected proportion of simulations where thep-value is below 0.05 is smaller than 0.05.

> Fsim<-function(nvec=c(10,10,10),muvec=c(-1,0,1),sd=1){+ k<-length(nvec)

+ gr<-rep(1,times=nvec[1])

+ val<-rnorm(nvec[1],muvec[1],sd)

+ for (i in (2:k)){+ gr<-c(gr,rep(i,times=nvec[i]))

+ val<-c(val,rnorm(nvec[i],muvec[i],sd))}+ gr<-as.factor(gr)

+ dat<-data.frame(val,gr)

+ u<-summary(aov(val∼gr,data=dat))+ u}


Chapter 2

Design of Experiments andAN(C)OVA

2.1 Introduction

There is a big difference between observational studies and experimentalstudies. Especially with respect to the possible conclusions that can bedrawn. The example of the teaching methods described in section 1.3 isan observational study as it stands. From the three groups of people, ran-dom samples of size four were taken and these people were tested on somemeasure of achievement. The conclusion of this study could be that thereis a difference in expected level of achievement between the three groups.It is not possible to conclude that this difference is due to the difference inteaching method. In observational studies, the researcher only inspects as-pects of certain individuals without interfering in the circumstances of theseindividuals.

In many experimental studies, a more or less homogeneous collection ofexperimental units is divided into groups and these various groups get adifferent treatment. By assuring the various groups to be as comparable aspossible before the treatments are offered, one may conclude that if after thestudy differences turn out to be present between the various groups, thesedifferences are caused by the different treatments.

The branch within the field of statistics that addresses the question ofhow to set up decent experiments in order to be able to draw as strong aspossible conclusions from the results, is Design of Experiments. In section 2.2

19

20 CHAPTER 2. DESIGN OF EXPERIMENTS AND AN(C)OVA

the simplest of all designs is introduced: the completely randomized design.The data that are obtained from an experiment in that set-up can, underappropriate assumptions, be analyzed using the F -test described in section1.3. The result of such an Analysis of Variance (ANOVA) can be summarizedin a so-called ANOVA table of the type encountered in the teaching methodexample of section 1.3.

The randomized block design is another experimental design, where (known)inhomogeneity in the individuals before the start of the study is taken intoaccount by carefully selecting the units that are assigned the various treat-ments using ‘blocking’. Also this design has a corresponding ANOVA thatcan be reported in an ANOVA table. This design will be studied in section2.3.

Sometimes, a treatment is a combination of various other treatments.E.g., a patient can be given aspirine or not and in the same experiment begiven a certain type of insuline or not. In principle, there are then four differ-ent treatments: the patient gets both substances, only aspirine, only insulineor none of both. However, there is some structure in these treatments. Suchtreatment structure is very often encountered and is called a factorial struc-ture. Section 2.4 describes the general form of such a study and its associatedANOVA table.

Using blocking of experimental units, heterogeneity in these units can betaken into account. Sometimes it is not possible to use the blocking strategy.Then one has to correct for heterogeneity in the experimental units afterhaving completed the experiment. The method of analysis is then calledAnalysis of Covariance (ANCOVA). This technique will be briefly describedin section 2.5.

2.2 Completely randomized design

Consider a factory where chips for the computer industry are manufactured.There are some criteria a chip has to satisfy in order to be acceptable forthe computer industry. One aspect of interest is the weight of the metalcoating of a certain part in the chip. At the factories where the chips areused, samples of chips are checked on this coating and some controversy hasarisen concerning the quality of the chips. Now the manufacturer and thecomputer assemblers agree to perform an experiment on the quality of the

laboratories. To that end, a sample from the day-production of chips is taken

2.2. COMPLETELY RANDOMIZED DESIGN 21

and the idea is to distribute these chips among the four laboratories (one ofthe chip manufacturer and three of various computer assembling companies).The measurements are destructive, meaning that the weight of the coatingof one chip cannot be measured at more than one laboratory.

Suppose each laboratory is to be assigned 12 chips from the day produc-tion. A total of 48 chips is selected. Now the question arises: which chip toassign to which laboratory.

In order to prevent any systematic differences in the chips that are sentto the various laboratories, the assignment is to be done at random. Thesimplest way to do this is as follows. First number the chips as they aregiven, say as 1 to 48. Then type the following in R.> chip<-1:48

> lab<-rep(c("A","B","C","D"),each=12)

> assignment<-cbind(sample(chip),lab)

Then assignment is a two-column matrix with in the first column the numberof the chip and in the second the laboratory where to send it to.

This concept of randomization is essential for proper experimental design.For example, if in this example the chips would be manually assigned by aperson, this person could consciously or unconsciously select the better chipsfor one laboratory.

Using randomization, all laboratories are treated fairly. All have the

same probability of getting the inherently ‘best’ or ‘worst’ chips.

Now, suppose the measurements from the various laboratories are givenin the file chipdat.txt in the R working directory of which the first coupleof lines read asLab weight

A 0.25

A 0.27

A 0.22

Then the line> labdat<-read.table("chipdat.txt",header=TRUE)

creates data frame containing the data in R. In order to test the null hypoth-esis that the laboratories do not differ in measurement of the coating weights,we can use the aov function also used in section 1.3. First, we inspect thedata graphically using boxplots:> mA<-labdat[(labdat[,1]=="A"),2]


> mB<-labdat[(labdat[,1]=="B"),2]

> mC<-labdat[(labdat[,1]=="C"),2]

> mD<-labdat[(labdat[,1]=="D"),2]

> boxplot(mA,mB,mC,mD,names=c("A","B","C","D"))

Now the analysis of variance:

A B C D

0.20

0.25

0.30

Figure 2.1: Boxplots of measurements at different laboratories.

> lab.aov<-aov(weight~lab,data=labdat)

> summary(lab.aov)


Lab 3 0.013006 0.004335 2.8097 0.05038

Residuals 44 0.067892 0.001543

The conclusion may be that there is no significant difference in the meansof the measurements for the various laboratories if α is taken to be 0.05.

We applied the F -test in the k-sample problem in this situation. It isinstructive to look at this F -statistic from another point of view. We firststate a model equation for the measurements in this experiment. Measure-ment yij (j-th measurement for laboratory i) is considered to be a realizedvalue of the random variable Yij, where

Yij = µ+ αi + εij (2.1)

Here µ stands for the expected weight of the metal coating if one of the labo-ratories is randomly chosen. The parameters αi are the so-called laboratory-effects (for i = 1, 2, 3, 4), which sum to zero (α1 + α2 + α3 + α4 = 0) and εij

2.2. COMPLETELY RANDOMIZED DESIGN 23

is a random variable with expectation zero describing the random variationsamong the various measurements. These random variables are assumed tobe normally distributed. In fact, (2.1) can be viewed as a decomposition of

the random variable Yij in terms of an overall expectation, treatment effectand random (normally distributed) noise term. Analogously, one can makea decomposition of the measured data yij

yij = y··+ (yi· − y

··) + yij − yi· = µ+ αi + eij (2.2)

in terms of the grand mean, the estimated treatment effect and residuals. Letus take a look at this decomposition in the small-sample problem (for thelaboratory data the same structure is present, but the decomposition wouldtake too much space) with the teaching method data of section 1.3

Method A 73 77 67 71Method B 91 81 87 85Method C 72 77 76 79

Writing decomposition 2.2 for the whole data matrix (having yij in row j,column i), we get

73 91 7277 81 7767 87 7671 85 79

=

78 78 7878 78 7878 78 7878 78 78

+

−6 8 −2−6 8 −2−6 8 −2−6 8 −2

+

1 5 −45 −5 1

−5 1 0−1 −1 3

Note that in the first matrix, n = 3× 4 = 12 numbers have to be entered todetermine the matrix completely. In the first matrix at the right hand side,there is exactly one choice to be made to completely determine the matrix(since all entries in the matrix are the same). For the second matrix at theright hand side, there is a constraint that the columns add up to zero andthat the each column only contains one value. This means that, in order tofill this matrix, one has 3 − 1 = 2 degrees of freedom. The values for twocolumns can be fixed, the value for the third column is then implicitly chosen.Finally, the last matrix at the right hand side has still 12− 1− 2 = 9 degreesof freedom left. This matrix has the structure that the rows should sum tozero. Note that we encountered these degrees of freedom already in section1.3 when the F statistic was considered. The F -statistic encountered there,can be seen as the ratio of two sums of squares divided by the associateddegrees of freedom. In this case, this corresponds to the sum of the squared


elements of the second matrix at the right hand side (divided by 2) and thesum of the squared elements of the last matrix at the right hand side (dividedby 9). Indeed,

SStotal = SStreat + SSresidual

with

SStotal =k∑

i=1

ni∑

j=1

(yij − y··)2, SStreat =

k∑

i=1

ni∑

j=1

(yi· − y··)2 =

k∑

i=1

ni(yi· − y··)2

and

SSres =k∑

i=1

ni∑

j=1

(yij − yi·)2.

2.3 Randomized block design

Suppose there is some obvious difference between the chips in one day pro-duction. Say there are three different production lines and the weight ofthe metal coating might differ a bit from production line to production line.Suppose the chips were allocated randomly to the various laboratories andit turned out that laboratory A had 9 chips from production line 1, whereasthe other laboratories would only get one chip from that production line. Ifit happened to be the case that the expected weight of the metal coating ofchips from production line 1 would be smaller than that of the others whereasthe laboratories were identical, the conclusion could be that laboratory A hasa measurement bias downward, whereas this effect is in fact entirely due tothe difference in production lines. One way of dealing with heterogeneity ofthe experimental units, is to perform analysis of variance with blocking. Infact the idea is quite straightforward. We illustrate the idea using anotherexample.

Example It is of interest to investigate whether four different forms of astandardized reading test are in fact equivalent. To this end, a total oftwenty students is randomly selected from five different schools (four fromeach) to do these tests. The heterogeneity in experimental units (students)that is present since they are from different schools, is removed by blocking.Within a block (the four students from school j), the treatments (type ofform) should be randomly assigned. This can be done via the following Rcode:

2.3. RANDOMIZED BLOCK DESIGN 25

> stud<-1:20

> school<-rep(1:5,each=4)

> test<-sample(1:4)

> for (i in 2:5) test<-c(test,sample(1:4))

The data obtained using this design are

School 1 School 2 School 3 School 4 School 5Form 1 75 73 59 69 84Form 2 83 72 56 70 92Form 3 86 61 53 72 88Form 4 73 67 62 79 95

Figure 2.2 visualizes these data, focusing on differences in achievement overthe different forms and difference in achievement between different schools.

7573

59

69

84 83

72

56

70

9286

61

53

72

88

736762

79

95(1) (2) (3) (4)

75

8386

73 7372

6167

595653

6269707279

84

928895

(1) (2) (3) (4) (5)

Figure 2.2: Visualization of achievements using 4 different forms and for 5

different schools.

√

Suppose in the randomized block design there are a treatments and b blocks.Then the following model is assumed for the measured quantity Yij corre-sponding to the subject receiving treatment i in block j

Yij = µ+ αi + βj + εij (2.3)

for 1 ≤ i ≤ a and 1 ≤ j ≤ b. Here µ is an overall expectation, αi is thetreatment effect at level i (these sum up to zero), βj is the block effect of block


j (these sum up to zero). It is the treatment effects αi which are of primaryinterest in this investigation. The data decomposition related to model (2.3)is given by

yij = y··+ (yi· − y

··) + (y

·j − y··) + (yij − yi· − y

·j + y··) = µ+ αi + βj + eij

Here the numbers αi, βj and eij are called the estimated treatment effects,estimated block effects and residuals respectively.

Remember there are a levels of the treatment and b blocks. Analogouslyto the line of thought in the completely randomized design setting, we caninterpret degrees of freedom in this situation. The matrix of measurementscontains ab numbers, with an overall mean y

··. Hence, the array containing

the numbers yij − y··has ab − 1 degrees of freedom. The arrays containing

the estimated treatment- and block effects have a − 1 and b − 1 degrees offreedom respectively. The residual array has (a−1)(b−1) degrees of freedom.

The data decomposition results in a sum of squares decomposition

SStotal = SStreat + SSblock + SSresidual

with

SStotal =a∑

i=1

b∑

j=1

(yij − y··)2, SStreat = b

a∑

i=1

(yi· − y··)2, SSblock = a

b∑

j=1

(y·j − y

··)2

and

SStotal =a∑

i=1

b∑

j=1

(yij − yi· − y·j + y

··)2.

The degrees of freedom of these sums of squares are indeed respectively ab−1,a− 1, b− 1 and (a− 1)(b− 1).

The result of the Analysis of Variance can be summarized in the ANOVAtable of the following structure:

SS df MS F pFBlock SSbl b− 1 SSbl/(b− 1) MSbl/MSres -Treat SStr a− 1 SStr/(a− 1) MStr/MSres -Resid SSres (a− 1)(b− 1) SSres/(a− 1)(b− 1)Total SStot ab− 1

2.3. RANDOMIZED BLOCK DESIGN 27

The first p-value in the last column is only of secondary interest. It reflectsthe necessity of blocking in hindsight. The second gives the p-value on thenull hypotheses of absence of treatment-effects (α1 = · · · = αa = 0).

Example (revisited) We first construct a data frame achdat with compo-nents achiev, form and school. Based on the vectors achiev, form andschool as they are given in the data table, this data frame can be con-structed via> fdat<-data.frame(achiev=achiev,form=as.factor(form),

+ school=as.factor(school))

Then, the ANOVA table is obtained by> ach.aov<-aov(achiev∼school+form,data=fdat)> summary(ach.aov)


school 4 2326.70 581.68 20.5721 2.65e-05 ***

form 3 42.95 14.32 0.5063 0.6852

Residuals 12 339.30 28.28

---

Signif.codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Here the conclusion is that there is the null hypothesis (there is no form-effect) may not be rejected at the usual levels of significance. There is,however, a major block-effect. Again take a look at Figure 2.2 to see whetherthis conclusion seems reasonable.

Based on the object ach.aov, estimates of the parameters (effects) in themodel can be extracted using the R function model.tables. In this case thisresults in

> model.tables(ach.aov)

Tables of effects

form

1 2 3 4

-1.45 1.15 -1.45 1.75

school

1 2 3 4 5

5.80 -5.20 -15.95 -0.95 16.30

Note that the estimated form effects do not differ significantly from zero. Theestimated block effects suggest that students from the third school performed


rather poorly, whereas students from the fifth school performed relativelywell. See also Figure 2.2.

√

2.4 Factorial treatment structure

In the completely randomized and randomized block designs considered inthe previous sections, the treatments consisted of one factor that had variouslevels. It is also possible that the effects of more than one factor are to bestudied simultaneously.

Example. There are various choices a manager can make that may affectthe daily sales of a certain canned food product. At a certain market store,management wishes to investigate the effect of choices as shelf position (low,medium or high) and package size (10, 12, 16 or 24 oz.) on the daily salesof the product. An ‘experimental unit’ in this situation is a specific day.For such a day, a choice of ‘treatment’ is made. There is a total of 12different treatments: each combination of shelf position (3 levels) and packagesize (4 levels) corresponds to one treatment. Management decided to assigneach treatment to six days (experimenting thus for 72 days), and only thedays Tuesday, Wednesday and Thursday are available for experimentation.These days are considered to be equivalent. If this were not explicitly stated,one could have blocked for this factor and have constructed three blocks,corresponding to the different types of days. Also some balancing in timewould be natural.

Now the following approach was adopted. First the various treatmentswere systematically listed and then randomly assigned to the available daysas follows:> shelf<-rep(1:3,each=24)

> size<-rep(1:4,times=18)

> des<-cbind(shelf,size)

> des<-des[sample(1:72),]

> des<-cbind(1:72,des)

According the the thus obtained scheme, the product was put on shelf andat the end of the day the sales (in dollars) were determined. The table belowshows part of the thus obtained data.

2.4. FACTORIAL TREATMENT STRUCTURE 29

shelf size salesLow 10 70.10Low 12 72.25...

......

High 24 70.30

Figure 2.3 shows the boxplots of the sales data, split according to the variouslevels of the two factors.

1 2 3

6070

8090

100

110

120

1 2 3 4

6070

8090

100

110

120

Figure 2.3: Boxplots of subsets of the sales data, chosen according to the

levels of the factors shelf and size.

√

In this example, one could use the analysis of variance for the completelyrandomized design. Then one would consider the treatments as one factorwith 12 levels. However, one can also take into account the so-called factorialstructure in the treatments. The model decomposition for the observationsis

Yijk = µ+ αi + βj + γij + εijk (2.4)

Here µ is again the overall expectation. αi is the main effect of the first factorat level i (these effects sum to zero), βj is the main effect of the second factorat level j and γij is the interaction effect of the two factors at levels i and j.If the interaction effects are all zero, the model is called additive.

The associated data decomposition is given by

yijk = y···+ (yi·· − y

···) + (y

·j·− y

···) + (yij·

− yi·· − y·j·

+ y···) +


+yijk − yij·= µ+ αi + βj + γij + eijk

The hats on top of the quantities at the right hand side indicate these areestimates of the corresponding parameters in (2.4). Based on this decom-position, one can decompose the total sum of squares into various sums ofsquares in the same fashion as was done in the previous sections. These sumsof squares enter the ANOVA table for this problem.

SS df MS F pFFact1 SSf1 a− 1 SSf1/(a− 1) MSf1/MSres -Fact2 SSf2 b− 1 SSf2/(b− 1) MSf2/MSres -Interact SSint (a− 1)(b− 1) SSbl/(a− 1)(b− 1) MSint/MSres -Resid SSres ab(r − 1) SSres/ab(r − 1)Total SStot abr − 1

The p-values in the final column are p-values associated to the various teststhat the main effects of the two factors are zero (first two lines) or the inter-action effects are zero.

Example (revisited) In R this ANOVA table can be obtained as follows:> sd.aov<-aov(sales∼shelf*size,data=sd)> summary(sd.aov)


shelf 2 8630.5 4315.2 198.209 < 2.2e-16 ***

size 3 13785.8 4595.3 211.071 < 2.2e-16 ***

shelf:size 6 2810.7 468.4 21.517 2.634e-13 ***

Residuals 60 1306.3 21.8

---

Signif.codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

These results show that the the main effect of the two factors shelf and sizeare nonzero. Moreover, also the interaction terms also turn out to be nonzero(in other words: the appropriate model is not additive). The estimated effectscan be obtained via

> model.tables(sd.aov)

Tables of effects

shelf

2.4. FACTORIAL TREATMENT STRUCTURE 31

1 2 3

-14.96 4.03 10.93

rep 24.00 24.00 24.00

size

1 2 3 4

-0.6493 9.12 13.74 -22.21

rep 18.0000 18.00 18.00 18.00

shelf:size

size

shelf 1 2 3 4

1 -1.122 -5.858 -7.463 14.442

rep 6.000 6.000 6.000 6.000

2 2.928 2.267 1.920 -7.116

rep 6.000 6.000 6.000 6.000

3 -1.807 3.590 5.543 -7.326

rep 6.000 6.000 6.000 6.000

The main-effect estimates can be related to the boxplots in Figure 2.3. Theinteraction terms can be visualized using the interaction plot in Figure 2.4.If the interaction terms were close to zero (i.e. the model would be nearly

6070

8090

100

110

sd$shelf

mea

n of

sd$

sale

s

1 2 3

sd$size

3214

Figure 2.4: Interaction plot, showing the means of the various sales, restrictedto particular factor level combinations.

additive), the interaction plot would look like four vertically shifted copies


of one polygon. From the interaction plot (and also from the table showingthe estimated interaction effects) it follows that level 4 of the size is ratherstrange in relation to the other levels. Therefore, we also study a model forthe sales, where the number of levels of the size factor is restricted to three:sales corresponding to the largest size are removed from the data.

> sdmin<-sd[(sd$size!=4),]

> sdmin.aov<-aov(sales~shelf*size,data=sdmin)

> summary(sdmin.aov)


shelf 2 10996.9 5498.4 223.2812 < 2.2e-16 ***

size 2 1943.6 971.8 39.4622 1.263e-10 ***

shelf:size 4 307.5 76.9 3.1221 0.02381 *

Residuals 45 1108.2 24.6

---

Signif.codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

It is clear that the interactions are considerably less significant than in theanalysis where the four sizes were studied. The estimates of the variouseffects are given by

> model.tables(sdmin.aov)

Tables of effects

shelf

1 2 3

-19.78 6.402 13.37

rep 18.00 18.000 18.00

size

1 2 3

-8.054 1.716 6.338

rep 18.000 18.000 18.000

shelf:size

size

shelf 1 2 3

1 3.693 -1.044 -2.649

rep 6.000 6.000 6.000

2.5. ANALYSIS OF COVARIANCE 33

2 0.556 -0.105 -0.452

rep 6.000 6.000 6.000

3 -4.249 1.148 3.101

rep 6.000 6.000 6.000

√

2.5 Analysis of Covariance

The randomized block design can be used if there is a nuisance factor thatcannot be held constant. The subjects are then divided into homogeneousgroups and within those groups (blocks), the idea of randomization is usedto assign the treatments. In many situations, one or more quantitative co-variates corresponding to the subjects can be measured before or after theexperiment is conducted. One way to include this heterogeneity would be todivide the subjects into more or less homogeneous groups and use these asblocks. In this section we discuss an alternative method to take care of thesecovariates. These are not used to block the subjects, but these are correctedfor by extending the ANOVA model to a so-called ANCOVA model (Analysisof Covariance).

Example A research worker has three types of cleaner, C1, C2 and C3. Hewishes to select the most efficient product for cleaning a metallic surface.The cleanliness of a surface is measured by a quantity called reflectivity.This is expressed in arbitrary units as the ratio of the reflectivity observedto that of a standard mirror surface. Now an experiment is conducted using12 pieces of metal, the experimental units. These experimental units havethe following initial reflectivity (so before the cleaner is used):

0.50 0.55 0.60 0.35 0.75 1.651.00 1.10 0.60 0.90 0.80 0.70

Now a completely randomized design is applied, and a cleaner is assignedto each experimental unit. Each cleaner is applied to four pieces of metal.The following reflectivity measurements are obtained. The notation 0.50 →1.00 means that the associated experimental unit (piece of metal) had initialreflectivity 0.50 and after treatment it had reflectivity 1.00.


C1 C2 C3

0.50→ 1.00 0.75→ 0.75 0.60→ 1.000.55→1.20 1.65→ 0.60 0.90→ 0.700.60→0.80 1.00→ 0.55 0.80→ 0.800.35→1.40 1.10→ 0.50 0.70→ 0.90

Using the code below, we get Figure 2.5.

> x<-c(0.50,0.55,0.60,0.35,0.75,

+ 1.65,1.00,1.10,0.60,0.90,0.80,0.70)

> y<-c(1.00,0.75,1.00,1.20,0.60,

+ 0.70,0.80,0.55,0.80,1.40,0.50,0.90)

> cl<-as.factor(rep(1:3,each=4))

> plot(x,y,type="n",xlab="",ylab="")

> text(x,y,cl)

0.4 0.6 0.8 1.0 1.2 1.4 1.6

0.6

0.8

1.0

1.2

1.4

1

1

1

1

22

2

2

3

3

3

3

Figure 2.5: Visualization of the data, plotting the reflectivity after treatment

against the reflectivity before the treatment using the number of the cleaner

as plotting symbol.

√

The general model for the data where there are a different treatments andeach treatment is applied to r units is given by

Yij = µ+ αi + δxij + εij (2.5)

2.5. ANALYSIS OF COVARIANCE 35

for 1 ≤ i ≤ a and 1 ≤ j ≤ r. Here xij is the measured covariate for thej-th unit having received treatment i. Also in this model, the result of thetest on the absence of effects can be summarized in an ANOVA table. Theprinciple is again the same: a data-decomposition based on (2.5) results ina decomposition of the total sum of squares into sums of squares relatedto the treatment, to the covariate and a residual sum of squares. We willnot make these concepts more explicit, but will illustrate the procedure andinterpretation of the results using the example with the different types ofcleaner.

Example (revisited) In order to execute the analysis of covariance, we firstagain have to define an appropriate data frame in R and load the modelformula into the function aov.

> clean<-data.frame(y=y,cl=cl,x=x)

> clean.aov<-aov(y~x+cl,data=clean)

> summary(clean.aov)


x 1 0.10037 0.10037 1.4530 0.2625

cl 2 0.13199 0.06600 0.9553 0.4246

Residuals 8 0.55264 0.06908

Here we see that the estimated treatment effects do not differ significantlyfrom zero. The p-value of 0.4246 indicates that.

√


Further reading

In this final section some references will be given that can be used for furtherstudy. The material on normal distributions can be read in Statistical Reason-ing for everyday life, the material used with the bachelor course ToegepasteStatistiek for students informatiekunde. A text with an accessible descriptionof the material covered in these notes (and more) is Statistical Methods in

Agriculture and Experimental Design. Also other, a bit more mathematicallyflavored text books can be consulted. E.g. Miller & Freund’s Probability and

Statistics for Engineers.There are various extensions possible. For example, it can happen that

the response variable is multivariate. In that case, effects of treatmentscan be studied using multivariate analysis of variance (MANOVA). In othersituations, the normality of the response variable is questionable (or the dataare even measured on a non-quantitative scale). Then one could use modelsthat are based on ranks rather than on the original data. The two-sampleWilcoxon test and Kruskal Wallis test could be used as alternative for thetwo-sample t-test and k-sample F -test. More on that approach can be foundin Nonparametrics, statistical methods based on ranks.

- Bennett, Briggs & Triola (2003). Statistical reasoning for everyday life.Second edition. Addison Wesley.

- Lehmann, E.L. (1975) Nonparametrics, statistical methods based on

ranks. Wiley

- Mead, R., Curnow, R,N., and Hasted, A.M. (1993) Statistical Methods

in Agriculture and Experimental Biology. Chapman & Hall

- Johnson, R.A. (2005). Miller & Freund’s Probability and Statistics for

Engineers. Pearson Education.

37

Documents

Notes