2.1 Linear regression (Part A revision)dlunn/BS1_05/WBS1_linmod.pdf2 Linear models 2.1 Linear...

Preview:

Citation preview

2 Linear models

2.1 Linear regression (Part A revision)

2.1.1 The modelling assumptions

You have already seen that a linear model can be written in the form

Y = Xθ + ε

where X is a n × p matrix of known constants, θ is a vector of p unknown parametersand ε is a vector of departures or residuals with distribution N (0, σ2I). Remember that,in order to maximise the likelihood, we need to minimise

nXi=1

ε2i = εTε =(y −Xθ)T (y −Xθ) = ky −Xθk2 ,

and we showed that ky −Xθk achieves its minimum when the so-called normal equationsXTXbθ = XTY (1)

are satisfied by the estimator bθ.You should remember that we did this by letting VX be the vector subspace spanned bycolumns of X, and letting rank (X) = r ≤ p so that dim (VX) = r. Then

E(Y) = Xθ ∈VX

for which

|| isminimised

y � ���X

VX

y � �X

yy � �X^

X�

X�̂

Figure 2.1

Clearly ky −Xθk achieves its minimum over θ ∈Rp when Xθ is the orthogonal projec-

tion of y onto VX. ThusXT (y −Xθ) = 0,

and Equation 1 follows.

Note that, before you can go ahead and estimate these coefficients, you need to be extracareful about validating the assumptions which underlie the method of fitting used.These are

4

• the variance of the response is constant;• the distribution of the response is normal;• the model you are trying to fit is a valid one.

You also need to remember that, when XTX is non-singular, the estimators have usefulproperties.

Property (i) bθ is unbiased for θ; that is E hbθi = θ

¤Property (ii) The covariance matrix of bθ is Σθ = σ2

¡XTX

¢−1.

¤Property (iii) bθ = ¡XTX

¢−1XTY is a linear transformation of the normally distrib-

uted random vector Y, so that

bθ ∼ N³θ,σ2

¡XTX

¢−1´.

¤

Example 2.1 (Revision example) Abrasion loss in rubber

The data set rubber.txt is given below in Table 2.1, and is taken from an experiment ontyre wear (Davies, 1972). Each row contains three observations on a sample, these beingabrasion loss (grams/hour), hardness (degrees Shore) and tensile strength

¡kgf/cm2

¢.

Table 2.1 Abrasion loss in rubberLoss (Y ) Hardness (x1) Strength (x2) Loss (Y ) Hardness (x1) Strength (x2)372 45 162 196 68 173206 55 233 128 75 188175 61 232 97 83 161154 66 231 64 88 119136 71 231 249 59 161112 71 237 219 71 15155 81 224 186 80 16545 86 219 155 82 151221 53 203 114 89 128166 60 189 341 51 161164 64 210 340 59 146113 68 210 283 65 14882 79 196 267 74 14432 81 180 215 81 134228 56 200 148 86 127

5

Suppose we try to fit a linear model of the form

Yi = α+ β1xi1 + β2xi2 + εi

using the computer. First we read in the data and use attach to produce vectors withthe column names.

> rubber <- read.table("e:/Lecture_OBS1/rubber.txt",header = 1)> attach(rubber)

Then we check that the model we want to fit looks valid by inspecting scatterplots.

> plot(Hardness,Loss)

You can see from Figure 2.1 that the dependence of Loss on Hardness is plausibly linear.

50 60 70 80 90

5010

015

020

025

030

035

0

Hardness

Loss

Figure 2.2 Plot of Loss against Hardness

> plot(Strength,Loss)

6

120 140 160 180 200 220 240

5010

015

020

025

030

035

0

Strength

Loss

Figure 2.3 Plot of Loss against Strength

This is much less convincing and there may well be no obvious relationship at all.

Suppose, after looking at these plots, we fit the model

Yi = α+ β1xi1 + εi.

> rubh <- lm(Loss ~ Hardness)> summary(rubh)

This produces the following output.

Call:lm(formula = Loss ~ Hardness)Residuals:

Min 1Q Median 3Q Max-86.15 -46.77 -19.49 54.27 111.49Coefficients:

Estimate Std. Error t value Pr(>|t|)(Intercept) 550.4151 65.7867 8.367 4.22e-09 ***Hardness -5.3366 0.9229 -5.782 3.29e-06 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 60.52 on 28 degrees of freedomMultiple R-Squared: 0.5442, Adjusted R-squared: 0.5279F-statistic: 33.43 on 1 and 28 DF, p-value: 3.294e-06

Look at the coefficient for Hardness, that is β1, which is significantly different from zero.Most of the information given in the output is familiar from Part A. R-squared is definedas (more on this definition later - for the moment you can think of it as a measure of thestrength of the relationship):

sum of squares explained by the modeltotal sum of squares

.

7

Adjusted R-squared, which is an R-squared value adjusted for degreees of freedom is ameasure we shall not use, and you can safely ignore it. The residual standard error,given here as 62.45 on 28 degrees of freedom, is calculated froms

(y −Xθ)T (y −Xθ)n− r

,

so that for this example n− r = 28. You derived this estimate when you did Part A.

But it was unclear from the plots whether we should also regress on Strength. The easiestway to do this is to plot the residuals from regression on Hardness against Strength.

> reshard <- resid(rubh)> plot(Strength,reshard)

120 140 160 180 200 220 240

−50

050

100

Strength

resh

ard

Figure 2.4 Plot of residuals against Strength

It looks as if there is indeed a relationship which is plausibly a straight line.

This method provides a useful way of assessing whether or not it is worth adding avariable. In general, you can plot the residuals from a multiple regression on any numberof variables against another possible explanatory variable to get some idea of whether toincorporate it and whether the relationship might be linear. Such plots are called addedvariable plots.

Regression on both Hardness and Strength gives the following output.

> rubreg <- lm(Loss ~ Hardness + Strength)> summary(rubreg)

Call:lm(formula = Loss ~Hardness + Strength)

8

Residuals:Min 1Q Median 3Q Max

-79.385 -14.608 3.816 19.755 65.981Coefficients:

Estimate Std. Error t value Pr(>|t|)(Intercept) 885.1611 61.7516 14.334 3.84e-14 ***Hardness -6.5708 0.5832 -11.267 1.03e-11 ***Strength -1.3743 0.1943 -7.073 1.32e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 36.49 on 27 degrees of freedomMultiple R-Squared: 0.8402, Adjusted R-squared: 0.8284F-statistic: 71 on 2 and 27 DF, p-value: 1.767e-11

Note the improvement in R-squared to 84%. Of course, we still need to check theassumptions. Recall how you did this in Part A. The constant variance assumption canbe checked from a plot of residuals against fitted values and the normality assumptionfrom a normal Q-Q plot.

resrub <- resid(rubreg)> fitrub <- fitted(rubreg)> plot(fitrub, resrub, xlab = "Fitted value", ylab = "Residual")> qqnorm(resrub, ylab = "Residuals")

50 100 150 200 250 300 350

−50

050

Fitted values

Res

idua

ls

−2 −1 0 1 2

−50

050

Normal Q−Q Plot

Theoretical Quantiles

Res

idua

ls

Figure 2.5: Diagnostic plots for rubber

The plot of Residuals against Fitted values shows a reasonably even scatter and givesno reason to doubt the constant variance assumption. The normal Q-Q plot is plausiblylinear and gives no reason to doubt the normality assumption. From Figures 2.1 and 2.3the model we have fitted is a valid one.¤

9

2.1.2 Linearity

You will recall that there are two kinds of linearity involved in fitting models such asthe one in Example 2.1; there is linearity in the explanatory variables and linearity inthe parameters. We emphasised the fact that the term linear when applied to statisticalmodels means linear in the parameters. You have already seen that non-linearity in thevariables can be dealt with by treating terms such as x2 as new additional explanatoryvariables which are functions of the old ones and then fitting a multiple regression modelwhich is linear in (x, x2). This can, of course, be taken further. Thus

E (Y | x) = α+ β1x+ β2x2 + · · ·+ βpx

p,

E (Y | x1, x2) = α+ β1x1 + β2x2 + γx1x2,

E (Y | x) = α+ βex,

are all linear in the parameters, but

E (Y | x) = α+ βeγx

is not linear in (α, β, γ).

Example 2.2Classify the models below as linear or non-linear.

(i) E (Y | x1, x2) = α+ β1x1 + β2x21 + γx1x2;

(ii) E (Y | x1, x2) = α+ β1 log x1 + β2 log x2;

(iii) E (Y | x) = α+ β (x+ γ)−1 ;

(iv) E (Y | x) = α+ β log (x+ γ) .

¤(i) and (ii) are linear, (iii) and (iv) are not.¥

2.2 Model selection

2.2.1 Added variable plots

You saw this useful technique in Example 2.1. Let us try to underline it’s real value inmodelling by looking at a familiar example from Part A; namely the data on Olympicsprint times. The data file sprints.txt contains the times for the Olympic 100m, 200m,400m, 800m, 1500m for the years 1900 to 2000 inclusive, excluding the Olympics whichwere not held during the years of the first and second world wars. You will recall thatwe decided to omit the 100m on the grounds that the event is too dependent upon thestart to be modelled in the same way as the others. We then converted each time to an

10

average speed by dividing it into the race distance, and then used the variables Speed,Distance, Year and Altitude (given in feet above sea level),

Speed = α+ β1Distance+ β2Distance2 + β3Y ear + β4 log (Altitude) .

Note that, in order to avoid very large numbers, the Distance variable is given in unitsof 100 metres rather than in metres. For convenience, these data are given in the filesprintspeed.txt.

Example 2.3 Olympic sprint speeds

When we looked at this piece of modelling in Part A, you were happy to take my wordfor it that one of the explanatory variables should be log(Altitude). How might we havearrived at this?

Let us suppose that we have fitted a model with Speed regressed onDistance, Distance2

(remember that the original distances have been divided by 100) and Y ear. Now it isquite clear that Altitude ought to have some effect and, indeed, if we simply add it as avariable we obtain the output given below.

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) -9.866e+00 7.755e-01 -12.723 < 2e-16 ***Year 1.054e-02 3.971e-04 26.552 < 2e-16 ***Distance -5.893e-01 1.234e-02 -47.748 < 2e-16 ***Distance2 2.120e-02 6.952e-04 30.449 < 2e-16 ***Altitude 3.124e-05 8.160e-06 3.829 0.000242 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.1159 on 87 degrees of freedomMultiple R-Squared: 0.9913, Adjusted R-squared: 0.9909F-statistic: 2480 on 4 and 87 DF, p-value: < 2.2e-16

Now everything looks fine and the model seems pretty good; there is even an R-squaredof 0.9913, and you might think that was the end of the story. But suppose you wereto look at an added variable plot. First we regress Speed on Distance, Distance2 andY ear, and then we plot the residuals against Altitude.

> fastr <- resid(lm(Speed ~Year + Distance + Distance2))> plot(Altitude,fastr)

11

0 2000 4000 6000

−0.

3−

0.2

−0.

10.

00.

10.

2

Altitude

fast

r

Figure 2.6 Plot of residuals against Altitude

Clearly all is not well. Most of the altitudes are low and the four high ones are havingtoo great an influence. It is necessary to spread them out a bit at the low end and pullthe four high ones in, so how about trying the square root of Altitude?

> plot(sqrt(Altitude),fastr)

0 20 40 60 80

−0.

3−

0.2

−0.

10.

00.

10.

2

sqrt(Altitude)

fast

r

Figure 2.7 Plot of residuals against√Altitude

This is a bit better, but the four points to the right are still too influential and there isstill too much bunching at low values of Altitude: there also seems to be some curvature.Try taking the log of Altitude.

> plot(log(Altitude),fastr)

12

2 4 6 8

−0.

3−

0.2

−0.

10.

00.

10.

2

log(Altitude)

fast

r

Figure 2.8 Plot of residuals against log(Altitude)

Now you can see why we chose log(Altitude) as an explanatory variable. The final modelgave the output below.

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) -9.6887689 0.7276597 -13.315 < 2e-16 ***Year 0.0104018 0.0003744 27.784 < 2e-16 ***Distance -0.5893413 0.0115732 -50.923 < 2e-16 ***Distance2 0.0212043 0.0006519 32.527 < 2e-16 ***log(Altitude) 0.0285146 0.0053292 5.351 7.01e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.1087 on 87 degrees of freedomMultiple R-Squared: 0.9924, Adjusted R-squared: 0.992F-statistic: 2823 on 4 and 87 DF, p-value: < 2.2e-16

We shall now look at techniques for assessing the adequacy of models and criteria forchoosing between alternative ones.

2.2.2 Some notational conventions

Let us begin by defining some notational conventions that are commonly used. Firstly,recall that

RSS =³y −Xbθ´T ³y −Xbθ´

= yTy − yTXbθ.But from Equation 1

XTXbθ = XTy or yTX =bθTXTX

13

so thatRSS = yTy−bθTXTXbθ, (2)

which is not only pleasingly symmetric but will turn out to be useful.

Now recall that we can write our linear model in the form

yi = α+

pXj=1

βjxij , i = 1, . . . , n,

or equivalently

yi = αp +

pXj=1

βj (xij − x.j) , i = 1, . . . , n,

where

x.j =1

n

nXi=1

xij and αp = α+

pXj=1

βjx.j

Now, writing bθ = µ bαpbβ¶, where bβ = ³bβ1, . . . , bβp´T

and

XTX =

⎛⎜⎜⎝ n... 0T

· · · ... · · ·0

... XTpXp

⎞⎟⎟⎠ , where Xp =

⎛⎜⎝ x11 − x.1 · · · x1p − x.p...

. . ....

xn1 − x.1 xnp − x.p

⎞⎟⎠ .

Equation 2 can now be written

RSS = yTy−ny2−bβTXT

pXpbβ

The following shorthand is introduced. We shall write

RSS =¡yTy−ny2¢ − bβT

XTpXp

bβRSS = TSS − ESS (p)

where RSS stands for residual sum of squares, TSS for total sum of squares and ESS (p)for explained sum of squares (i.e. sum of squares explained by regression on p explanatoryvariables).

14

2.2.3 Testing for no linear relationship

Suppose the individual t-tests of coefficients in a model do not result in rejection of thenull hypotheses that each is zero. Does this mean there is no linear relationship? Theanswer is NO.

It is important to realise that a series of tests of the null hypotheses

H0 : βi = 0, βj unspecified for j 6= i

is not the same asH0 : β1 = β2 = · · · = βp = 0,

which is the null hypothesis of no linear relationship.

The likelihood is

L(θ, σ2;y) = (2πσ2)−n/2 expµ− 1

2σ2(y −Xθ)T (y −Xθ)

¶,

where θT = (α, β1, . . . , βp).

We already know the m.l.e. bθ of θ. The m.l.e. of σ2 is found from the log-likelihood

l(θ, σ2;y) = −n2log 2π − n

2log σ2 − 1

2σ2(y −Xθ)T (y−Xθ),

so that∂l

∂σ2= − n

2σ2+

1

2σ4(y −Xθ)T (y −Xθ)

which is zero whenσ2 =

1

n(y −Xθ)T (y −Xθ).

The invariance theorem for likelihood gives

bσ2 = 1

n(Y −Xbθ)T (Y −Xbθ) = Q2

n.

Under the general alternative, therefore,

sup©L(θ, σ2;y), (θ, σ2) ∈ Θª = µ2πq2

n

¶−n/2e−n/2 .

If, under H0,Q20 = (Y −Xbθ0)T (Y −Xbθ0),

then

sup©L(θ, σ2;y), (θ, σ2) ∈ Θ0

ª=

µ2πq20n

¶−n/2e−n/2

15

and the likelihood ratio statistic is

λ(y) =

µq20q2

¶−n/2.

Writing (you will see why shortly)

λ(y) =

µ1 +

q20 − q2

q2

¶−n/2and recalling that the critical region has the form

C1 = {y : λ(y) ≤ k} ,

then, sinceq20 − q2

q2= λ(y)−2/n − 1

is a decreasing function of λ(y), the critical region is given by

C1 =

½y :

q20 − q2

q2≥ c

¾.

In the context of testing for no linear relationship, H0 : β1 = β2 = · · · = βp = 0,

q20 = (y −Xθ0)T (y −Xθ0) = yTy−ny2

and q2 = (y −Xθ)T (y −Xθ) = yTy−ny2−bβTXT

pXpbβ

soq20 − q2 = ESS(p)

andq20 − q2

q2=

ESS(p)

RSS.

This is intuitively reasonable. The test for no linear relationship is based on the relativesizes of the sum of squares explained by regression and the residual sum of squares: thelarger the ratio, the more likely it is that there is a linear relationship.

We already know thatXbθ is the orthogonal projection of y onto VX. A similar argumentleads to Xbθ0 being the orthogonal projection of y onto V0, the r− k dimensional vectorsubspace of VX which contains Xbθ0 when testing a null hypothesis that k of the βi arezero. This is shown in the figure.

16

y y X� �^

X�0

^

VX

y X� �0

^

X�^

Figure 2.9

Now we use a “double decker” version of the construction we used previously, defining{e1, . . . , er−k} as an orthonormal basis for V0, and extending it to {e1, . . . , er} for VXwith further extension to {e1, . . . , en} for Rn. With independent normally distributedrandom variables Z1, Z2, . . . , Zn as before,

Y =r−kXi=1

Ziei +rX

i=r−k+1Ziei +

nXi=r+1

Ziei.

We have already shown in Part A that

Q2 =nX

i=r+1

Z2i .

We now have also that

Xbθ0 = r−kXi=1

Ziei

and

Y −Xbθ0 = nXi=r−k+1

Ziei

so that

Q20 =

nXi=r−k+1

Z2i .

Therefore

Q20 −Q2 =

rXi=r−k+1

Z2i

which is independent of

Q2 =nX

i=r+1

Z2i .

17

Under H0,Q20 −Q2

σ2∼ χ2(k)

andQ2

σ2∼ χ2(n− r).

Definition 2.1: The Fisher distribution (or F -distribution)

If X1 and X2 are independent χ2 random variables with respective degrees of freedomν1 and ν2 respectively, then the random variable

X1/ν1X2/ν2

∼ F (ν1, ν2).

¤This definition results in

(n− r)

k

(Q20 −Q2)

Q2∼ F (k, n− r).

This result can now be used for testing H0 : β1 = β2 = · · · = βp = 0; here r = 1 + p.It is usual to set out the calculation in a table called an analysis of variance table (orANOVA for short).

Source ofvariation df SS MS Ratio

Regression on p ESSESS

pF (p, n− p− 1)

x1, . . . , xp

Residual n− p− 1 RSSRSS

n− p− 1Total n− 1 TSS

This is a traditional form of setting out the calculation which had its origins in the pre-computer days. Most statistical software provides ANOVA output as a standard optionwith the regression package, usually in the universally recognised layout shown above.In R you simply apply the function anova to the output of the lm function.

Example 2.4 (Example 2.1 revisited) Abrasion loss in rubber

> anova(rubreg)Analysis of Variance Table

Response: LossDf Sum Sq Mean Sq F value Pr(>F)

Hardness 1 122455 122455 91.970 3.458e-10 ***Strength 1 66607 66607 50.025 1.325e-07 ***Residuals 27 35950 1331

18

---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

¤Example 2.5 (Example 2.3 revisited) Olympic sprint times

In Example 2.3, we fitted Speed against Year, Distance, Distance2, and log(Altitude).Suppose we now exercise the option of having an ANOVA table produced as part of theoutput. We obtain the following results.

Analysis of Variance TableResponse: Speed

Df Sum Sq Mean Sq F value Pr(>F)Year 1 10.295 10.295 872.055 < 2.2e-16 ***Distance 1 110.207 110.207 9335.217 < 2.2e-16 ***Distance2 1 12.490 12.490 1058.001 < 2.2e-16 ***log(Altitude) 1 0.338 0.338 28.629 7.014e-07 ***Residuals 87 1.027 0.012---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

According to the output, all of the variables are significant. But look at what happenswhen you fit Speed against Distance2, Distance, Year and log(Altitude), i.e. you changetheir order when you specify the model.

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) -9.6887689 0.7276597 -13.315 < 2e-16 ***Distance2 0.0212043 0.0006519 32.527 < 2e-16 ***Distance -0.5893413 0.0115732 -50.923 < 2e-16 ***Year 0.0104018 0.0003744 27.784 < 2e-16 ***log(Altitude) 0.0285146 0.0053292 5.351 7.01e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.1087 on 87 degrees of freedomMultiple R-Squared: 0.9924, Adjusted R-squared: 0.992F-statistic: 2823 on 4 and 87 DF, p-value: < 2.2e-16

Compare this with Example 2.3 and you will see that the numbers are exactly the same:but what about the ANOVA table?

Analysis of Variance TableResponse: Speed

19

Df Sum Sq Mean Sq F value Pr(>F)Distance2 1 92.084 92.084 800.055 < 2.2e-16 ***Distance 1 30.614 30.614 2593.162 < 2.2e-16 ***Year 1 10.295 10.295 872.055 < 2.2e-16 ***log(Altitude) 1 0.338 0.338 28.629 7.014e-07 ***Residuals 87 1.027 0.012---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The TSS and RSS are unchanged, but the contribution of each variable to the sum ofsquares explained by the model clearly depends upon the order in which each variable isincluded. How, then, are we to look at the effect upon the model of leaving out one ormore variables? In other words, how can we tell how big a contribution a single variableor a group of variables is making?¤

2.2.4 Finding subsets of explanatory variables

The answer to the problem of finding a suitable subset of explanatory variables lies inlooking at the change to the RSS brought about by leaving some of them out. If themodel is fitted on p variables, a subset of k variables is then taken out and the model re-fitted on the remaining p−k variables, the change in the RSS will be the sum of squarescontributed by that subset. Furthermore, if RSSp and RSS p−k are the respective residualsums of squares for fitting p variables and p− k variables with a sample of size n,

(RSSp−k −RSSp) /k

RSSp/ (n− p− 1) ∼ F (k, n− p− 1)

under the null hypothesis that the coefficients of the k variables are all zero.

Example 2.6 Olympic sprint times

Altitude makes the smallest contribution to the sum of squares so use the difference inRSS to test β4 = 0 in the model. Refitting without the log(Altitude) variable results inthe ANOVA output below.

20

Analysis of Variance TableResponse: Speed

Df Sum Sq Mean Sq F value Pr(>F)Distance2 1 92.084 92.084 5936.26 < 2.2e-16 ***Distance 1 30.614 30.614 1973.54 < 2.2e-16 ***Year 1 10.295 10.295 663.68 < 2.2e-16 ***Residuals 88 1.365 0.016

Leaving out log(Altitude) has changed the RSS from 1.027 to 1.365, an increase of 0.338.We can write this in an ANOVA table.

Source ofvariation df SS MS RatioRegression on 3 132.993x1, . . . , x3Extra from x4 1 0.338 0.338 28.63Residual 87 1.027 1.027/87Total 91 134.358

Test against F (1, 87) to obtain a p-value of 0.000001. This confirms that we need toregress on all four explanatory variables.¤

Example 2.7 Skeletal muscle in rats

The data are taken from a study of rat skeletal muscle by Dr M. Khan and Mrs M. Khanwhich appear in Hand et al. (1994).

The data are counts of fibres in rat skeletal muscle. A group of fibres is called a fascicleand any fascicle contains fibres of two different types, these being Type I and TypeII. Type I fibres are further subdivided into reticulated, punctate, both reticulated andpunctate.

We are interested in the relationship between the number of fibres of Type II and thenumbers in the different subdivisions of Type I.

Model:y = α+ β1x1 + β2x2 + β3x3.

21

Table 2.2 Skeletal muscle in ratsNumber of Type I fibres Number of

Fascicle Reticulated Punctate Both Type II fibresnumber X1 X2 X3 Y

1 1 13 5 152 2 8 4 123 9 27 16 464 4 5 2 125 2 12 7 246 2 31 16 667 1 13 15 458 8 27 16 509 1 5 5 1810 1 8 5 1511 1 2 2 412 1 11 3 1713 1 8 6 1514 2 17 5 3015 1 14 4 1716 1 11 3 1617 2 12 2 1918 1 8 7 1419 1 8 4 1420 0 4 3 721 1 18 5 2622 0 11 10 2623 2 15 7 2424 0 0 4 525 0 4 3 6

Scatterplots of y against each of x1, x2, x3 are given below.

22

0

40

80

0

40

80

0

40

80

0 5 10

0 16 32 0 9 18

.. .

...

...

... ......

.

.

.

.

. ..

...

.

.. .. .

.

. .

. .

.

.....

..

..

.... . ..

..

.

.

.

.

.

.

.

.

y

y y

x

x x2 3

1

Figure 2.10 Scatterplots of rat muscle data

Initial thoughts:

• Possibly no relationship between y and x1.

• Linear relationships between y and x2, x3?

• No indication of systematic variation with either x2 or x3.

Call:lm(formula = y ~x1 + x2 + x3)Residuals:Min 1Q Median 3Q Max-6.537 -2.508 -0.756 1.280 7.500Coefficients:

23

Estimate Std. Error t value Pr(>|t|)(Intercept) -1.4135 1.5146 -0.933 0.361x1 -0.8432 0.4925 -1.712 0.102x2 1.1563 0.1787 6.471 2.06e-06 ***x3 1.7525 0.2844 6.163 4.09e-06 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 4.003 on 21 degrees of freedomMultiple R-Squared: 0.94, Adjusted R-squared: 0.9314F-statistic: 109.6 on 3 and 21 DF, p-value: 5.466e-13

Drop x1.

Call:lm(formula = y ~x2 + x3)Residuals:Min 1Q Median 3Q Max-7.7449 -2.2429 -0.6214 1.5926 8.0921Coefficients:

Estimate Std. Error t value Pr(>|t|)(Intercept) -1.0452 1.5636 -0.668 0.511x2 1.0407 0.1726 6.031 4.53e-06 ***x3 1.6681 0.2921 5.711 9.60e-06 ***

---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 4.175 on 22 degrees of freedomMultiple R-Squared: 0.9316, Adjusted R-squared: 0.9254F-statistic: 149.8 on 2 and 22 DF, p-value: 1.531e-13

All of this seems straightforward, but it is worth taking another look at the coefficientsof x2 and x3. Note that they are not very different; could it be worth looking at a totalpunctate count and trying to fit the variable x2 + x3? First, produce the ANOVA tablefor this model.

Analysis of Variance TableResponse: y

Df Sum Sq Mean Sq F value Pr(>F)x2 1 4655.0 4655.0 267.061 8.655e-14 ***x3 1 568.5 568.5 32.617 9.600e-06 ***Residuals 22 383.5 17.4

Now regress on x = x2 + x3 and produce the ANOVA table.Coefficients:

Estimate Std. Error t value Pr(>|t|)(Intercept) -1.08324 1.59846 -0.678 0.505x 1.26404 0.07491 16.874 1.90e-14 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

24

Residual standard error: 4.269 on 23 degrees of freedomMultiple R-Squared: 0.9253, Adjusted R-squared: 0.922F-statistic: 284.7 on 1 and 23 DF, p-value: 1.896e-14

Analysis of Variance TableResponse: y

Df Sum Sq Mean Sq F value Pr(>F)x 1 5188.0 5188.0 284.73 1.896e-14 ***Residuals 23 419.1 18.2

The extra sum of squares accounted for by separate coefficients for x2 and x3 is 419.1−383.5 = 35.6. Therefore test

35.6/1

383.5/22= 2.043

against F (1, 22).

> 1-pf(35.6*22/383.5,1,22)[1] 0.1670303

We have obtained a p-value of 0.167. Therefore we may regress on x2 + x3 providednormality assumptions are justified.¤

2.3 Choosing explanatory variables

Example 2.8 Insurance availability

This is a data set on insurance availability in Chicago census neighbourhoods in 1977-78.The data were collected as part of a study conducted by the U.S. Commission on CivilRights (1979), the object of which was to determine whether factors such as race andfamily income affected insurance availability after adjustment for direct factors such asthe risk of fire and theft. Variables are:

1. Percentage of minority races

2. Number of fires/1000 housing units

3. Number of thefts/1000 population

4. Percentage of housing units built before 1940

5. Net new homeowner policies/100 housing units

6. Median family income ($1000’s)

The objective is to find out which variables affect the issue of new policies.

Do it by regressing with variable 5 as response.

25

Table 2.3 Insurance availability in ChicagoRace Fire Theft Old Policies Income Race Fire Theft Old Policies Income10.0 6.2 29 60.4 5.3 11.744 74.2 18.5 22 78.3 1.8 8.01422.2 9.5 44 76.5 3.1 9.323 55.5 23.3 29 79.0 2.1 8.17719.6 10.5 36 73.5 4.8 9.948 62.3 12.2 46 48.0 3.4 8.21217.3 7.7 37 66.9 5.7 10.656 4.4 5.6 23 71.5 8.0 11.23024.5 8.6 53 81.4 5.9 9.730 46.2 21.8 4 73.1 2.6 8.33054.0 34.1 68 52.6 4.0 8.231 99.7 21.6 31 65.0 0.5 5.5834.9 11.0 75 42.6 7.9 21.480 73.5 9.0 39 75.4 2.7 8.5647.1 6.9 18 78.5 6.9 11.104 10.7 3.6 15 20.8 9.1 12.1025.3 7.3 31 90.1 7.6 10.694 1.5 5.0 32 61.8 11.6 11.87621.5 15.1 25 89.8 3.1 9.631 48.8 28.6 27 78.1 4.0 9.74243.1 29.1 34 82.7 1.3 7.995 98.9 17.4 32 68.6 1.7 7.5201.1 2.2 14 40.2 14.3 13.722 90.6 11.3 34 73.4 1.9 7.3881.0 5.7 11 27.9 12.1 16.250 1.4 3.4 17 2.0 12.9 13.8421.7 2.0 11 7.7 10.9 13.686 71.2 11.9 46 57.0 4.8 11.0401.6 2.5 22 63.8 10.7 12.405 94.1 10.5 42 55.9 6.6 10.3321.5 3.0 17 51.2 13.8 12.198 66.1 10.7 43 67.5 3.1 10.9081.8 5.4 27 85.1 8.9 11.600 36.4 10.8 34 58.0 7.8 11.1561.0 2.2 9 44.4 11.5 12.765 1.0 4.8 19 15.2 13.0 13.3232.5 7.2 29 84.2 8.5 11.084 42.5 10.4 25 40.8 10.2 12.96013.4 15.1 30 89.8 5.2 10.510 35.1 15.6 28 57.8 7.5 11.26059.8 16.5 40 72.7 2.7 9.784 47.4 7.0 3 11.4 7.7 10.08094.4 18.4 32 72.9 1.2 7.342 34.0 7.1 23 49.2 11.6 11.42886.2 36.2 41 63.1 0.8 6.565 3.1 4.9 27 46.6 10.9 13.73150.2 39.7 147 83.0 5.2 7.459

26

Boxplots of the variables are given below.

Race Fire Theft Old Policies Income

050

100

150

Figure 2.11 Boxplots of insurance variables

Some show a lot of skewness, which can be removed by taking

x2 = log(Fire), x3 = log(Theft)

and

x1 = log

µ1 +Race

101−Race

¶, x4 = log

µOld

100−Old

¶.

A word on x1 and x4. These are variables on a fixed scale; in this case 0− 100. Where atransformation is appropriate, it should be of the form f (x/(100− x)); this is called afolded transformation. The transformation which best removes the skewness in this caseturns out to be

log

µx

100− x

¶.

Note that, if x can take values very close to 0 or 100, it is better to be on the safe sideand take

log

µ1 + x

101− x

¶.

Finallyy = Policies, x5 = Income.

27

Scatterplots

−4 −2 0 2 4

02

46

810

1214

x1

y

1.0 1.5 2.0 2.5 3.0 3.5

02

46

810

1214

x2

y

1 2 3 4 5

02

46

810

1214

x3

y

−4 −3 −2 −1 0 1 2

02

46

810

1214

x4

y

5 10 15 20

02

46

810

1214

x5

y

Figure 2.12 Scatterplots of transformed variables

Notice the outlier in the last of the scatterplots. There is one income which is much

28

larger than the others and it will, if left in the analysis, produce an influence far inexcess of its entitlement. We shall return to the question of handling outliers but, in themeantime we shall exclude that particular row in the data table.

Regression coefficients

Linear Regression AnalysisResponse: Policies (y)Variable Coefficient s.e. t-value p-valueConstant α −0.2574 3.2251 −0.080 0.9368Race (x1) β1 −0.4908 0.2142 −2.292 0.0273Fires (x2) β2 −1.1103 0.5758 −1.928 0.0610Theft (x3) β3 0.3583 0.4473 0.801 0.4279Old (x4) β4 −0.5874 0.2710 −2.168 0.0362Income (x5) β5 0.7502 0.2352 3.190 0.0028R2 = 0.851 d.f. = 40 s = 1.639 RSS = 107.443

Some variables are not statistically significant. This can be due to high correlations

between them.

> cor(cbind(x1,x2,x3,x4,x5))

x1 x2 x3 x4 x5x1 1.0000000 0.7430068 0.3486494 0.2766339 -0.8142045x2 0.7430068 1.0000000 0.4747265 0.5153427 -0.7978000x3 0.3486494 0.4747265 1.0000000 0.4738686 -0.4044544x4 0.2766339 0.5153427 0.4738686 1.0000000 -0.5310488x5 -0.8142045 -0.7978000 -0.4044544 -0.5310488 1.0000000

You can see that some of these correlations are quite high, particularly between x1, x2and x5: i.e. between Race, Fires and Income, which is what one might expect.¤

2.3.1 Correlation

Why do high correlations matter? Suppose we look at a hypothetical regression where yis the vital capacity of a man’s lungs in litres, x1 is his age in years and x2 is his height.Then we are considering the model

y = α+ β1x1 + β2x2.

Now suppose that we divide individuals in the sample into groups according to theirheight, such as [1.55, 1.60) metres, [1.60, 1.65) metres, etc.

Consider the case where the correlation coefficient between the explanatory variables iszero and the case where it is high and positive.

29

..

..

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

.

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. ... ... .

Vital capacityHeightgroup

Vital capacity

AgeAge

[1,95,2.00)

[1.75,1.80)

[1.55,1.60)

....

....

Ages for groupsabout the same:correlation zero

Ages go up withheight: correlationcoefficient positive

Figure 2.13 Plot of y against x1 for each height group (x2)

To summarise:

• Restricting x2 in the uncorrelated case does not affect the spread of x1.• Restricting x2 in the highly correlated case tends to restrict x1so the spread isrelatively small.

• Lines are parallel so the common slope in the uncorrelated case is approximatelyequal to the regression coefficient of y on x1 at a fixed value of x2 (i.e. β1).

• In the correlated case, the separate slopes are poorly estimated because the spreadof x1 values is relatively small, so β1 is poorly estimated.

What happens in the two cases if we ignore the x2 grouping and simply fit a simpleregression of y on x1?

.

.

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. ... ... .

Heightgroup

Vital capacity

Age

..

..

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Vital capacity

Age

[1,95,2.00)

[1.75,1.80)

[1.55,1.60)

....

....

Regression of

on ignoring :correlation zero

y

x x1 2

Regression of on

ignoring : correlationcoefficient positive

y

x x1 2

Figure 2.14 Plot of y against x1 ignoring height group (x2)

30

With zero correlation, you can see that the slope of the fitted line is roughly the sameas those for the three separate slopes in Figure 2.13. Ignoring the height groups whenX1 and X2 are correlated, however, produces a very different effect. The approximateslope coefficient given by the common value of the separate slopes is negative, yet thetotal regression slope coefficient is positive.

Conclusion

High correlation coefficients (positive or negative) between explanatory variables leadsto poorly estimated regression coefficients.

Definition 2.2: Explanatory variables with a sample correlation coefficient of zero aresaid to be orthogonal.¤

How do we calculate p-values with correlated explanatory variables? Use an ANOVAtable.

Example 2.9 (Example 2.8 revisited) Insurance availability

First change to the same order as the p-values.

Linear Regression AnalysisResponse: Policies (y)Variable Coefficient s.e. t-value p-valueConstant α −0.2574 3.2251 −0.080 0.9368Income (x5) β5 0.7502 0.2352 3.190 0.0028Race (x1) β1 −0.4908 0.2142 −2.292 0.0273Old (x4) β4 −0.5874 0.2710 −2.168 0.0362Fires (x2) β2 −1.1103 0.5758 −1.928 0.0610Theft (x3) β3 0.3583 0.4473 0.801 0.4279r2 = 0.851 d.f. = 40 s = 1.639 RSS = 107.443

Now produce an ANOVA table with this ordering of explanatory variables.

ANOVA table

Analysis of Variance TableSource ofvariation df SS MS Ratio p-valueIncome (x5) 1 570.3886 570.3886 212.350 0.0000Race (x1) 1 14.5469 14.5469 5.4157 0.0251Old (x4) 1 18.9172 18.9172 7.0427 0.0114Fires (x2) 1 8.7814 8.7814 3.2692 0.0781Theft (x3) 1 1.7230 1.7230 0.6415 0.4279Residual 40 107.4430 2.6861Total 45 721.8001

31

We can condense this table as follows.

Analysis of Variance TableSource ofvariation df SS MS Ratio p-valueRegression on 3 603.8527 201.2842x5, x1, x4Extra from x2, x3 2 10.5044 5.2522 1.9553 0.1548Residual 40 107.4430 2.6861Total 45 721.8001

The p-value is obtained from testing 1.9553 as F (2, 40). Now re-run the regression withTheft and Fires left out.

Linear Regression AnalysisResponse: Policies (y)Variable Coefficient s.e. t-value p-valueConstant α −3.1998 2.2973 −1.393 0.1710Income (x5) β5 0.8982 0.2275 3.948 0.0003Race (x1) β1 −0.6099 0.2046 −2.981 0.0048Old (x4) β4 −0.6546 0.2522 −2.595 0.0130r2 = 0.837 d.f. = 42 s = 1.676 RSS = 117.947

This seems to be a model which fits nicely. but the analysis is not complete withoutchecking assumptions.

Look at a plot of residuals against fitted values and a normal probability plot of residu-als.

0 5 10 15

−3

−2

−1

01

23

4

Fitted values

Res

idua

ls

−2 −1 0 1 2

−3

−2

−1

01

23

4

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Figure 2.15 Plot of Residuals against Fitted values and normal probability plot ofResiduals

32

Conclusions

• Insurance is available to the higher income groups.• The older the property, the less the availability of insurance.• It is easier to get insurance if one is a white Caucasian.

These data were collected by the U.S. Commission on Civil Rights (1979) and publishedin a report Insurance Redlining: Fact not Fiction.

The report was the result of an examination of charges brought by several communityorganisations that insurance companies were redlining their neighbourhoods (cancellingpolicies, refusing to insure or renew, etc).¤

2.4 Binary explanatory variables

In Part A, Example 4.5, you saw a data set on ancient Etruscan and modern Italianskull widths. The data are reproduced below.

Table 2.4 Ancient Etruscan and modern Italian skull widthsAncient Etruscan skulls Modern Italian skulls

141 147 126 140 141 150 142 133 124 129 139 144 140148 148 140 146 149 132 137 138 132 125 132 137 130132 144 144 142 148 142 134 130 132 136 130 140 137138 150 142 137 135 142 144 138 125 131 132 136 134154 149 141 148 148 143 146 134 139 132 128 135 130142 145 140 154 152 153 147 127 127 127 139 126 148150 149 145 137 143 149 140 128 133 129 135 139 135146 158 135 139 144 146 142 138 136 132 133 131 138155 143 147 143 141 149 140 136 121 116 128 133 135158 141 146 140 143 138 137 131 131 134 130 138 138150 144 141 131 147 142 152 126 125 125 130 133140 144 136 143 146 149 145 120 130 128 143 137

In the example we carried out a t-test for the difference between means; the test statisticwas 11.92 which was tested against a t(152) distribution. But we could also take aregression approach to this problem. Suppose we combine the groups and call Yi theresponse for the ith individual, introducing an explanatory variable xi such that xi = 1for one group and xi = 0 for the other. We then fit the linear model

Yi = α+ βxi + εi, i = 1, . . . , n.

33

Clearly, for one group E (Y | x) = α+β and for the other E (Y | x) = α. The differencebetween means is β, and we can use the regression output to test the null hypothesisH0 : β = 0.

Example 2.10 Etruscan and Italian skull widths

The data, which are in the file eskulls.txt, are read in using the "fill" qualifier, whichallows for the groups being of different size by making up the smaller one with missingvalues. There are 84 measurements on Etruscans and 70 on modern Italians, so datavectors are extracted, the missing values are discarded and the data are then re-arrangedinto a single column of Y values with the groups being identified by allocating x = 1 toancient Etruscans and x = 0 to modern Italians.

> skulls <- read.table("c:/Datasets/eskulls.txt",header = 1,fill = TRUE)> attach(skulls)> Italian <- Italian[1:70]> y <- c(Etruscan,Italian)> x <- c((1:84)*0+1,(1:70)*0)> skull <- lm(y~x)> summary(skull)Call:lm(formula = y ~ x)Residuals:

Min 1Q Median 3Q Max-17.7738 -3.6911 -0.4429 4.2262 15.5571

Coefficients:Estimate Std. Error t value Pr(>|t|

(Intercept) 132.4429 0.7018 188.73 <2e-16 ***x 11.3310 0.9502 11.93 <2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 5.871 on 152 degrees of freedomMultiple R-Squared: 0.4833, Adjusted R-squared: 0.4799F-statistic: 142.2 on 1 and 152 DF, p-value: < 2.2e-16

You can see that the t-statistic obtained is exactly the same as the one we found before.If you like you can check this by entering the following.

> t.test(Etruscan,Italian,var.equal=TRUE)¤It would not, of course, be worth going to all of the trouble of taking a regression approachfor something as straightforward as a two-sample t-test but, as you will see shortly, thereis much more to using binary explanatory variables: the idea can be extended to usingboth binary and continuous variables in a multiple regression and to comparing morethan two samples.

34

2.4.1 Multiple regression with binary explanatory variables

Suppose that we return to our earlier example where Y refers to vital capacity of thelungs and and suppose that we would like to compare it between the sexes. Vital capacityis strongly related to height and men tend to be taller than women, so any apparentdifference between the sexes could be partly, or even entirely, due to difference in height.What is needed is the difference in mean vital capacity between men and women, adjustedfor height. We can do this by including bothX1 (height) andX2 (gender) in the regressionmodel, the latter as a binary variable (X2 = 1 for men, X2 = 0 for women).

The results of such an analysis are shown in Figure 2.16.

. .. ...... . .... .

. .

X

X

X

X

X

XX

X

XX

XX

X

X

X

y

x1

Females: y = x������ �

Males: y = x x����� ���� �

Mean vitalcapacity for

males

Mean vitalcapacity for

females

Unadjusteddifference

Adjusteddifference

Figure 2.16 Adjusting a difference between groups for X1

You can see that there is a marked difference between the difference ignoring height andthe difference adjusted for height (β2). The p-value for testing the hypothesis that thecoefficient β2 = 0, given by the regression output, would be the p-value for testing thehypothesis of no difference between the sexes.

Example 2.11 (Example 2.1 revisited) Abrasion loss in rubber

We can use the abrasion loss data to show how the method works in practice. Thistime we shall divide the data into two groups of 15; the rubber pieces with low tensilestrength

¡< 180 kgf/cm2

¢and those with high tensile strength

¡≥ 180 kgf/cm2¢. Figure2.17 shows a scatterplot with the two groups identified.

35

50 60 70 80 90

100

150

200

250

300

350

Hardness

Loss

Figure 2.17 ◦ = low strength, • = high strengthWhat is the difference in mean abrasion loss between high and low tensile strengthrubber, adjusted for hardness?

> summary(lm(Loss ~Hardness + High))Call:lm(formula = Loss ~Hardness + High)Residuals:

Min 1Q Median 3Q Max-57.017 -8.638 -1.724 16.398 62.174Coefficients:

Estimate Std. Error t value Pr(>|t|)(Intercept) 544.3326 32.4518 16.774 8.36e-16 ***Hardness -5.9864 0.4604 -13.002 3.88e-13 ***HighTRUE -103.4842 11.0243 -9.387 5.42e-10 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 29.85 on 27 degrees of freedomMultiple R-Squared: 0.8931, Adjusted R-squared: 0.8852F-statistic: 112.8 on 2 and 27 DF, p-value: 7.788e-14

The estimated difference in abrasion loss between the high and low tensile groups, ad-justed for hardness, is −103.48, and the difference is highly significant. A 95% confidenceinterval for the difference is

−103.48± 2.0518× 11.02,

where 2.0518 is the 0.975 quantile of t (27): this gives the interval (−80.87,−126.09).¤

36

2.4.2 Product terms in the model

You have already seen examples of non-linearity in the variables (e.g. involving aquadratic term in x), but there is a further kind of non-linearity in the variables. Supposewe have two explanatory variables, x1 and x2 and we fit

y = α+ β1x1 + β2x2 + γx1x2.

The main point of doing this is to test whether γ = 0. This is called testing for additivity,because, if γ = 0, then the model is

y = α+ β1x1 + β2x2

in which the contributions of the explanatory variables are added together.

If γ 6= 0 so that the model is not additive, then the effect of x1 on Y dependes on thevalue of x2 and the effect of x2 on Y dependes on x1: the two explanatory variables aresaid to interact in their effects on Y .

When one of the explanatory variables is binary, the test for γ = 0 is a test of whetherthe regression of Y on the continuous variable has the same slope for the two groups.Including an interaction term in Example 2.11 to give the model

E (Y | x1, x2) = α+ β1x1 + β2x2 + γx1x2

effectively results in fittingE (Y | x1) = α+ β1x1

for low-tensile rubber and

E (Y | x1) = α+ β2 + (β1 + γ) x1

for high-tensile rubber.

Example 2.12 (Example 2.1 revisited) Abrasion loss in rubber

The coefficients from fitting an interaction model are given below, having been obtainedby entering

> summary(lm(Loss~Hardness + High + Hardness*High))Coefficients:

Estimate Std. Error t value Pr(>|t|)(Intercept) 658.0548 42.8922 15.342 1.52e-14 ***Hardness -6.1284 0.5852 -10.473 8.03e-11 ***High -130.9288 68.8529 -1.902 0.0684 .Hardness:High 0.3934 0.9738 0.404 0.6895

The estimated coefficient of the interaction term is not significantly different from zero,and we conclude that the slopes are the same for the two groups.¤

37

2.4.3 Comparing several samples

Let us begin with an example.

Example 2.13 Diabetic mice

The following table gives the BSA nitrogen bound of three groups of diabetic mice [fromDolkart et al. (1971) Diabetes, 20, pages 162-167].

The first group are normal, untreated diabetics, the second have received alloxan treat-ment and the third have received alloxan plus insulin. We wish to know whether thetreatments have different effects on the BSA nitrogen bounds.

Table 2.5 BSA nitrogen bounds of three groups of diabetic mice

AlloxanNormals Alloxan +insulin

156 391 82282 46 100197 469 98297 86 150116 174 243127 133 68119 13 22829 499 131253 168 73122 62 18349 127 20110 276 100143 176 7264 146 13326 108 46586 276 40122 50 46455 73 34655 4414

Before we start any kind of formal analysis, we should look at the data.

0 200 400 600 800

Normal

Alloxan

Alloxan+ insulin

Figure 18 Boxplots of diabetic mice

38

The boxplots show some skewness and some outliers to the right. The positive skewnessis confirmed by calculating summary statistics for the three groups.

Size Mean s.d. SkewNormal 20 186.1 158.8 1.50Alloxan 18 181.8 144.8 1.04Alloxan + insulin 19 112.9 105.8 2.12

Furthermore, the data are not normally distributed.

��

���

��

���

� � �� � �

. . ..

..

..

... .

..

..

...

........

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

.

Figure 2.19 Normal QQ plot

Points were plotted after subtracting respective group means.

A transformation is called for in order both to remove the skewness and give plausibilityto a normal model.

Right skewness is often removed by taking logs, so it is worth trying. Summary statisticsfor the logs of the BSA nitrogen bounds are given below.

Size Mean s.d. SkewNormal 20 4.859 0.963 −0.607Alloxan 18 4.867 0.922 −0.622Alloxan + insulin 19 4.397 0.834 0.060

Much of the skewness has been removed and the standard deviations are similar.

The normal probability plot is also more encouraging.

39

��

� � �� � �

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

.

.

.

.

.

.

.

..

..

Figure 2.20 Normal QQ plot for transfomed data

Again, points were plotted after subtracting respective group means.

We now set it up to be investigated via regression, so that

E (Y | x2, x3) = α+ β2x2 + β3x3

where

x2 =

½1, if alloxan only,0, otherwise,

x3 =

½1, if alloxan + insulin,0, otherwise.

Then α gives normal level and β2, β3 give contrasts with it. Thus we can test whetheror not the Alloxan and the Alloxan + insulin groups differ from the Normal group bytesting for the regression coefficients being zero. In general, we should be able to setourselves up to handle p− 1 binary variables for p treatments as follows:

Binary variablesTreatment x2 x3 x4 · · · xp

1 0 0 0 · · · 02 1 0 0 · · · 03 0 1 0 · · · 0...

......

.... . .

...p 0 0 0 · · · 1

Thus we have the regression model

E(Y | x1, . . . , xp) = α+ β2x2 + · · ·+ βpxp,

where α is the expected response for a unit receiving treatment 1 and the βi, i = 2, . . . , p,are contrasts of the other treatments with treatment 1.

Carrying out the regression, we get the following.

40

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 4.85940 0.20323 23.911 <2e-16 ***alloxan 0.00749 0.29529 0.025 0.980alloxan.insulin -0.46238 0.29117 -1.588 0.118

Note that we cannot conclude that there is insufficient evidence of a difference betweentreatments. From this we can only conclude that there is no evidence of the alloxantreatment group and the alloxan + insulin treatment group being different from thenormal group. To test no difference between the three goups we need the appropriateanalysis of variance table. The ANOVA output from the regression is given below.

Analysis of Variance TableResponse: log(BSA)

Df Sum Sq Mean Sq F value Pr(>F)alloxan 1 0.667 0.667 0.8077 0.3728alloxan.insulin 1 2.083 2.083 2.5217 0.1181Residuals 54 44.607 0.826

This can be turned into an ANOVA table to test the three groups having equal meansby adding the sums of squares due to the regression. Thus

Analysis of Variance TableSource of variation df SS MS Ratio p-valueRegression on x1, x2 2 2.750 1.3750 1.665 0.199Residual 54 44.607 0.0826Total 56 47.357

The ratio 1.665 is tested as F (2, 54) to obtain the p-value 0.199. We conclude that thereis insufficient evidence of a difference between treatments.¤

The experiment in Example 2.13 is known in the jargon as a completely randomiseddesign, and the analysis is called a one-way analysis of variance. This is because ,within each treatment group, the mice are randomly allocated; that is, there is nothingspecial about the order of the measurements in each colomn. Contrast this with anotherdata set on the growth of pigs from four litters on three separate diets. Pigs from thesame litter are likely to be closer in growth rate than pigs from different litters, so itclearly makes no sense to allocate randomly to diets. In practice, four litters, each of 3piglets, were used with each diet being allocated at random to one piglet in each litter. Inthe jargon, the litters are said to form blocks and, because the three diets are randomlyallocated within each block, the experiment is referred to as a randomised block design.

41

The data are given below.

Blocks (Litters) TreatmentTreatments I II III IV totals meansDiet A 89 78 114 79 360 90.00Diet B 68 59 85 61 273 68.25Diet C 62 61 83 82 288 72.00Block totals 219 198 282 222 921Block means 73.00 66.00 94.00 74.00 76.75

The table shows the weight increase (in 100gm units) of each piglet after 6 months ofcontrolled diet. The question at issue is whether there is a significant difference in theeffect of the different diets.

Looking at the treatment totals, diet A seems to be the winner. Let us, however, try aone-way analysis of variance (i.e. let us, for the time being, ignore the fact that theremay be differences between litters).

Example 2.14 Weight gain of piglets

The data are given in the file piglets.txt. The first column of the file comprises the weightgains in units of 100 gm. The second column has 1 for diet A and zero otherwise, thethird column has 1 for diet B and zero otherwise; the fourth column has 1 for litter IIand zero otherwise and the fifth and sixth columns are constructed in the same way forlitters III and IV respectively. Thus we can carry out our one-way analysis of varianceby regressing the weight gain on columns 2 and 3.

Analysis of Variance TableResponse: Gain

Df Sum Sq Mean Sq F value Pr(>F)B 1 433.50 433.50 2.2913 0.16440C 1 648.00 648.00 3.4250 0.09724Residuals 9 1702.75 189.19

This can be condensed into the table below.

df SS MS FTreatments 2 1081.50 540.75 2.86Residual 9 1702.75 189.19Total 11 2784.25

Testing against F (2, 9) the p-value is 0.109. The hypothesis of no treatment differencesis not rejected.

This analysis is wrong because the block differences have not been taken into account.What effect would you expect this to have?

The answer is that the residual sum of squares has been inflated by ignoring the block

42

differences, thereby causing the F -ratio to be less than its critical value. A close lookat the table of weight gains shows that the gains in Litter III are each larger than theircounterparts in other litters. The RSS contains this heterogeneity in addition to theinherent variation.

The model must take account of block differences. Therefore we should fit the model

Yij = µ+ ti + bj + εij, i = 2, 3; j = 2, 3, 4.

and produce a two-way analysis of variance table.

anova(lm(Gain ~B + C + II + III + IV))Response: Gain

Df Sum Sq Mean Sq F value Pr(>F)B 1 433.50 433.50 6.5270 0.04321 *C 1 648.00 648.00 9.7566 0.02049 *II 1 462.25 462.25 6.9598 0.03864 *III 1 840.50 840.50 12.6550 0.01196 *IV 1 1.500 1.500 0.0226 0.88547Residuals 6 398.50 66.42

Again, we condense the table.

Source of variation d.f. S.S. M.S. FTreatments (Diets) 2 1081.50 540.75 8.14Blocks (Litters) 3 1304.25 434.75 6.55Residual 6 398.50 66.42Total 11 2784.25

The null hypothesis that the block means are equal is tested as F (3, 6) the value 6.55of the F -statistic gives a p-value of 0.025. Therefore it is rejected at the 2.5% level.Furthermore we can now use 8.14 to test as F (2, 6) for equal treatment means. Thep-value is 0.020, so that the null hypothesis of equal treatment means is rejected at the2% level.

Recall that when tested (incorrectly) using one-way ANOVA, the null hypothesis of equaltreatment means was not rejected.

We see that isolating the variation due to the litters has substantially reduced the RSS.This underlines the importance of designing the experiment so that systematic variationdoes not become muddled with treatment effects but BEWARE! It is important not tocreate blocks unless there are grounds for suspecting real heterogeneity.

Note that incorporating blocks reduced the number of degrees of freedom of the residualfrom 9 to 6. This will have the effect of increasing the critical value of the F -statistic.Therefore the reduction in residual sum of squares by isolating the block variation mustbe more than sufficient to compensate for the reduction in the residual degrees of freedom.

43

So which diet is best? Which produces the largest weight gain? We need to look at theregression output

Coefficients:

.

Estimate Std. Error t value Pr(>|t|)(Intercept) 86.250 5.763 14.967 5.6e-06 ***B -21.750 5.763 -3.774 0.00924 **C -18.000 5.763 -3.124 0.02049 *II -7.000 6.654 -1.052 0.33332III 21.000 6.654 3.156 0.01967 *IV 1.000 6.654 0.150 0.88547

Diet A produces a significantly larger weight gain than Diet B (p-value 0.009) and asignificantly larger weight gain than Diet C (p-value 0.02). Diet C is better than DietB, but is the difference significant?

If t̂B and t̂C are estimates of the contrasts of Diets B and C respectively with Diet A,then the estimated contrast of Diet C with Diet B is t̂C − t̂B. Since t̂C ∼ N (tC , σ

2C) and

t̂B ∼ N (tB, σ2B), then

t̂C − t̂B ∼ N¡tC − tB, σ

2C + σ2B − 2C

¡t̂C , t̂B

¢¢or, in the usual notation,

t̂C − t̂B ∼ N³tC − tB, σ

2³¡XTX

¢−122+¡XTX

¢−133− 2 ¡XTX

¢−123

´´.

From the package,¡XTX

¢−122=¡XTX

¢−133= 0.375,

¡XTX

¢−123= 0.125,

sot̂C − t̂B ∼ N

¡tC − tB, 0.5σ

and thereforet̂C − t̂B√0.5σ̂2

∼ t (6) .

t̂C − t̂B√0.5σ̂2

=−18.00 + 21.75√0.5× 66.42 = 0.6507

is tested against t (6) for a p-value of 0.5393.

Of course, the lazy way to do this is just to create a vector of indicators for Diet A (i.e.a binary explanatory variable for Diet A) and regress on Diet A and Diet C, along withthe block variables. Then the C coefficient is the contrast of Diet C with Diet B and the

p-value is as given.

44

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 64.500 5.763 11.193 3.04e-05 ***A 21.750 5.763 3.774 0.00924 **C 3.750 5.763 0.651 0.53932II -7.000 6.654 -1.052 0.33332III 21.000 6.654 3.156 0.01967 *IV 1.000 6.654 0.150 0.88547

¤The design you have just seen in Example 2.14 is usually referred to as a balanced design.A completely randomised design is balanced when numbers of cases in each treatmentare equal: thus the design in Example 2.13 was unbalanced because there were differentnumbers of mice in each treatment group. Note that the regression technique we useddid not require the design to be balanced.

For the randomised block design you saw in Example 2.14, the number of units in eachblock had been chosen to equal the number of treatments. Because each treatment wasused the same number of times within each block (once in this example), the experimentis said to be balanced within blocks. Does a randomised block design need to be balancedfor the regression technique to work?

Example 2.15 Testosterone levels of chronic schizophrenics

The data in the table are testosterone levels (in mg/100 ml) of eight chronic schizophrenicpatients during three treatment periods (Beumont et al. (1974)). Four subjects did notprovide data during the chloropromazine (CPZ) treatment period, so the data array isunbalanced.

TreatmentPatient Chronic Placebo CPZH.M. 381 682 −R.H. 142 393 −L.M. 190 358 −H.H. 173 282 −O.W. 139 253 293A.W. 599 641 681E.W. 607 528 676M.P. 421 467 511

It is the column effect (treatment in this case) which is of interest.

A two-way ANOVA can be carried out by constructing the following layout, which youwill find in the file testos.txt.

45

Patient TreatmentTestosterone R.H. L.M. H.H. O.W. A.W. E.W. M.P. Placebo CPZ

381 0 0 0 0 0 0 0 0 0142 1 0 0 0 0 0 0 0 0190 0 1 0 0 0 0 0 0 0173 0 0 1 0 0 0 0 0 0139 0 0 0 1 0 0 0 0 0599 0 0 0 0 1 0 0 0 0607 0 0 0 0 0 1 0 0 0421 0 0 0 0 0 0 1 0 0682 0 0 0 0 0 0 0 1 0393 1 0 0 0 0 0 0 1 0358 0 1 0 0 0 0 0 1 0282 0 0 1 0 0 0 0 1 0253 0 0 0 1 0 0 0 1 0641 0 0 0 0 1 0 0 1 0528 0 0 0 0 0 1 0 1 0467 0 0 0 0 0 0 1 1 0293 0 0 0 1 0 0 0 0 1681 0 0 0 0 1 0 0 0 1676 0 0 0 0 0 1 0 0 1511 0 0 0 0 0 0 1 0 1

First the ANOVA table for the linear regression model is obtained

Analysis of Variance TableResponse: Testosterone

Df Sum Sq Mean Sq F value Pr(>F)R.H. 1 52258 52258 9.8927 0.0104165L.M. 1 60434 60434 11.4404 0.0069748H.H. 1 121836 121836 23.0640 0.0007209O.W. 1 264148 264148 50.0042 3.41e-05A.W. 1 24611 24611 4.6590 0.0562539E.W. 1 23213 23213 4.3943 0.0624685M.P. 1 5096 5096 0.9647 0.3491714Placebo 1 28522 28522 5.3994 0.0425076CPZ 1 46659 46659 8.8327 0.0139982Residuals 10 52825 5283

This may be condensed into the two-way ANOVA format.

Source of variation d.f. S.S. M.S. FTreatments 2 75181 37590.5 7.12Blocks (Patients) 7 551596 79799.4 15.10Residual 10 52825 5283Total 19 679602

46

Test 7.12 against F (2, 10) to obtain a p-value of 0.012, and we reject the null hypothesisof no treatment differences. You could go on to produce the regression coefficients andthereby determine p-values for individual treatment differences.¤

2.5 Weighted least squares

We have seen how to test the assumptions of the linear model, but what do we do if one(or more) of the assumptions does not hold?

In particular, suppose the equal variance assumption does not hold.

2.5.1 Simple case - zero intercept, one explanatory variable

For zero intercept, we have the model

Yi = βxi + εi, i = 1, 2, . . . , n,

where εi ∼ N(0, σ2i ) and σ21, σ22, . . . , σ

2n are not all equal.

Suppose we transform the equation so that the assumptions of the linear model aresatisfied. Divide each equation by σi. Then

Y ∗i = βx∗i + ε∗i , i = 1, 2, . . . , n,

where Y ∗i = Yi/σi, x∗i = xi/σi, ε∗i = εi/σi.

The variance of ε∗i is now

V (ε∗i ) =σ2iσ2i= 1.

The assumptions are now satisfied for the ‘starred’ model and, therefore

bβ∗ = Px∗iY

∗iP

(x∗i )2=

PxiYi/σ

2iP

xi2/ σ2i.

The estimator bβ∗ is called the weighted least-squares estimator.Example 2.10

Samples are taken of lengths of nylon thread and the number of flaws in each sample isnoted. The results are

Sample number (i) 1 2 3 4 5 TotalLength of sample (xi m) 3 6 2 5 4 20Number of flaws (yi) 2 4 0 5 4 15

The problem is to estimate the average number of flaws per metre. Obviously thecommon-sense answer is 15/20 = 0.75.

47

Suppose we decide to fit the linear model

Yi = βxi + εi, i = 1, 2, . . . , 5,

The ordinary least squares estimate of β is

bβ = PxiYiPxi2

=71

90= 0.789.

Now this would be fine if the assumptions of the linear model were reasonable, but arethey?

It hardly seems likely that the constant variance assumption holds. The variance of thenumber of flaws in a thread is likely to increase as the length of the thread increases.How, though, will it increase?

Variables such as the number of flaws in a thread may well have a Poisson distribution.If this is the case, the variance is equal to the mean, i.e.

V (εi) = E(Yi) = βxi,

so V (εi) is proportional to xi. To obtain equal variances then, we need to divide throughby√xi. Then

Y ∗i = βx∗i + ε∗i , i = 1, 2, . . . , 5,

where Y ∗i = Yi/√xi, x∗i =

√xi, ε∗i = εi/

√xi.

The variance of the departure is now

V (ε∗i ) =V (εi)

xi= β

which is the same for all i = 1, 2, . . . , 5.

The weighted least-squares estimator is

bβ∗ = Px∗iY

∗iP

(x∗i )2=

P√xiYi

±√xiP

(√xi)2

=

PYiPxi.

This gives the estimate bβ∗ = 15/20 = 0.75, which is the same as the common-senseestimate.¤Note: this is the estimate you would get using the method of maximum likelihood witha product of Poisson likelihoods.

48

2.5.2 General case of weighted least-squares

Consider the linear modelY = Xθ + ε

where X can be a n× p matrix of known constants and θ can be a vector of p unknownparameters.

Suppose that L is a linear transformation such that

ε∗ = Lε and V (ε∗) = LV (ε)LT = σ2I

where I is the identity matrix.

Then we may writeY∗ = X∗θ + ε∗

where Y∗ = LY and X∗ = LX.

The least-squares estimators of the parameters are given by

XTLTLXbθ = XTLTLY

or, if XTLTLX is non-singular

bθ = ¡XTLTLX¢−1

XTLTLY.

49

Recommended