Statistical Modeling I - 2020.qmplus.qmul.ac.uk

Statistical Modeling I

Luca Rossini

School of Mathematical Sciences, QMUL

https://lucarossini.wixsite.com/luca-rossini

Last Update: February 4, 2021

Contents

1 Statistical Models and Modelling 1

2 Simple Linear Regression 5

2.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Least Squares Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Properties of the Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Assessing the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.1 Analysis of Variance Table . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.2 F test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.3 Estimating σ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4.4 Coefficient of Determination . . . . . . . . . . . . . . . . . . . . . . 20

2.5 Fitted Values and Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5.1 Properties of Residuals . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5.2 Standardised residuals . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.3 Residual plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5.4 Settlement Payment example using R . . . . . . . . . . . . . . . . . 24

2.6 Inference about the regression parameters . . . . . . . . . . . . . . . . . . . 27

2.6.1 Inference about β1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.6.2 Inference about β0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.6.3 Inference about E(Yi|X = xi) . . . . . . . . . . . . . . . . . . . . . 32

2.6.4 Prediction Interval for a new observation . . . . . . . . . . . . . . . 33

2.7 Further Model Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

i

https://lucarossini.wixsite.com/luca-rossini

2.7.1 Outliers and influential observations . . . . . . . . . . . . . . . . . . 34

2.7.2 Example - Gessel scores . . . . . . . . . . . . . . . . . . . . . . . . 36

2.7.3 Transformation of the response . . . . . . . . . . . . . . . . . . . . . 41

2.7.4 Example - Plasma level of polyamine . . . . . . . . . . . . . . . . . 42

2.7.5 Pure Error and Lack of Fit . . . . . . . . . . . . . . . . . . . . . . . 45

2.8 Matrix approach to simple linear regression . . . . . . . . . . . . . . . . . . 49

2.8.1 Vectors of random variables . . . . . . . . . . . . . . . . . . . . . . 49

2.8.2 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . 51

2.8.3 Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . . 51

2.9 Matrix approach to simple linear regression . . . . . . . . . . . . . . . . . . 57

2.9.1 Finding maximum likelihood estimates . . . . . . . . . . . . . . . . 57

2.9.2 Properties of maximum likelihood estimators . . . . . . . . . . . . . 59

2.9.3 Relationship with least squares estimates . . . . . . . . . . . . . . . 59

2.10 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3 Multiple Regression 61

3.1 Example - Dwine data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.2 Multiple Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . 67

3.3 Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.4 Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.4.1 F-test for the Overall Significance of Regression . . . . . . . . . . . 73

3.5 Inferences about the parameters . . . . . . . . . . . . . . . . . . . . . . . . 74

3.6 Confidence interval for µ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.7 Predicting a new observation . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.8 Example on sales data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.9 Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

3.9.1 F-test for the deletion of a subset of variables . . . . . . . . . . . . . 85

3.9.2 All subsets regression . . . . . . . . . . . . . . . . . . . . . . . . . 91

3.9.3 Automatic methods for selecting a regression model . . . . . . . . . 94

3.10 Problems with fitting regression models . . . . . . . . . . . . . . . . . . . . 99

3.10.1 Near-singularXTX . . . . . . . . . . . . . . . . . . . . . . . . . . 99

3.10.2 Variance inflation factor . . . . . . . . . . . . . . . . . . . . . . . . 100

3.11 Model checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

3.11.1 Standardised residuals . . . . . . . . . . . . . . . . . . . . . . . . . 104

ii

3.11.2 Leverage and influence . . . . . . . . . . . . . . . . . . . . . . . . . 105

3.11.3 Cook’s distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

3.12 What is a linear model? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

3.13 Polynomial regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

3.13.1 Example: Crop Yield . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4 Theory of Linear Models 112

4.1 The Gauss-Markov Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 112

4.2 Sampling distribution of MSE (S2) . . . . . . . . . . . . . . . . . . . . . . . 113

1 Statistical Models and Modelling

What do we mean by Statistical Modeling and a Statistical Model?

Think back to the first year. You looked at one sample problems. There were statements likeY1, Y2,. . .,Yn are independent and identically distributed normal random variables with meanθ and variance σ2. Another way of writing this is as

Yi = θ + εi i = 1, 2, . . . , n

where εi ∼ N(0, σ2) and are independent. We wanted to estimate θ, which we suggestedcould be done by using Y or test a hypothesis such as H0 : θ = θ0.

In Probability and Statistics II we also looked at two sample problems, for example that

Y11, Y12, . . . , Y1n ∼ N(θ1, σ2)

and independentlyY21, Y22, . . . , Y2n ∼ N(θ2, σ

2)

and were interested in the hypothesis H0 : θ1 = θ2. We can write this as a model

Yij = θi + εij i = 1, 2 j = 1, 2, . . . , n

where εij ∼ N(0, σ2) and are independent.

These statistical models have two components, a part which tells us about the average be-haviour of Y and a random part.

In this course, Statistical Modeling I, we are interested in models where we have quantita-tive measurements of a response variable or dependent variable Y and possible quantitativeexplanatory (or regressor or independent) variables X1, X2, . . . , Xp−1.

In this chapter we are going to discuss some of the ideas of Statistical Modeling.

1. We start with a real life problem for which we have some data.

2. We think of a statistical model as a mathematical representation of the variables wehave measured. This model usually involves some parameters.

1

3. We may then try to estimate the values of these parameters.

4. We may wish to test hypotheses about the parameters.

5. We may wish to use the model to predict what would happen in the future in a similarsituation.

In order to test hypotheses or make predictions we usually have to make some assumptions.Part of the modeling process is to test these assumptions. Having found an adequate model wemust compare its predictions, etc. with reality to check that it does seem to give reasonableanswers.

We can illustrate these ideas using a simple example.

Example 1.1. Suppose that we are interested in some items, widgets say, which are manufac-tured in batches. The size of the batch and the time to make the batch in hours are recordedin Table 1.1.

x (batch size) y (hours)30 7320 5060 12880 17040 8750 10860 13530 6970 14860 132

Table 1.1: Data on batch size and time to make each batch

We begin by plotting the data to see what sort of relationship might hold.

>x <- c(30, 20, 60, 80, 40, 50, 60, 30, 70, 60)>y <- c(73,50,128,170,87,108, 135,69,148,132)

> plot(x,y, main = "Time to make (y) vs size of batch (x)")

From Figure 1.1, it seems that a straight line relationship is a good representation of the dataalthough it is not an exact relationship. Using R we can fit this model.

> reg <- lm(y ~ x)> summary(reg)

Call:lm(formula = y ~ x)

2

20 30 40 50 60 70 80

60

80

100

120

140

160

Time to make (y) vs size of batch (x)

x

y

Figure 1.1: Scatterplot of time versus batch size.

Residuals:Min 1Q Median 3Q Max-3.0 -2.0 -0.5 1.5 5.0

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 10.00000 2.50294 3.995 0.00398 **x 2.00000 0.04697 42.583 1.02e-10 ***---Signif. codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.739 on 8 degrees of freedomMultiple R-squared: 0.9956,Adjusted R-squared: 0.9951F-statistic: 1813 on 1 and 8 DF, p-value: 1.02e-10

The output is as follows: the estimate of the intercept is 10.0 and the slope 2.0. The fitted lineis

Y = 10 + 2x.

One interpretation of this is that on average it takes 10 hours to set up the machinery to makewidgets and then it takes 2 hours to make each widget. We can add this line to the plot andobtain the fitted line plot, see Figure 1.2

> abline(10, 2, lty=3)

But before we come to this conclusion we should check that our data satisfies the assumptionsof the statistical model. One way to do this is look at residual plots, as in Figure 1.3. We shalldiscuss these later in the course and in the practicals but here there is no reason to doubt ourmodel.

3

20 30 40 50 60 70 80

60

80

100

120

140

160

Time to make (y) vs size of batch (x)

x

y

Figure 1.2: Fitted line plot of time versus batch size.

Statistical modelling is iterative.

1. We think of a model we believe will fit the data.

2. We fit it and then check the model.

3. If it is correct, we use the model to explain what is happening or predict what willhappen.

Note that we should be very wary of making predictions far outside of the x values which areused to fit the model.

In general different techniques are needed depending on whether the variables are:

• qualitative,

• quantitative continuous,

• quantitative discrete.

In Statistical Modeling I we will mostly study continuous Y with quantitativeX1, X2, . . . , Xp.

In Statistical Modeling II you would study continuous Y with qualitative X1, X2, . . . , Xp,

SMI and SMII use Linear Models.

In Time Series we relax the assumption that errors are independent or uncorrelated.

4

20 30 40 50 60 70 80

−20

24

Residuals versus x

x

reg$

resi

dual

s

Figure 1.3: Residual plots.

2 Simple Linear Regression

Some notation

Suppose we have values x1, x2, . . . , xn and y1, y2, . . . , yn then we define the sample mean as

x =1

n

n∑i=1

xi, y =1

n

n∑i=1

yi

Note all the summations below are over the range 1 to n.

We define the deviance as the square difference between the value of xi (the observation) andits sample mean, x as

Sxx =n∑i=1

(xi − x)2 =

(n∑i=1

x2i

)+

n∑i=1

(x2 − 2xix

)=

(n∑i=1

x2i

)+ nx2 − 2

n∑i=1

xix =

(n∑i=1

x2i

)+ nx2 − 2x

n∑i=1

xi

=

(n∑i=1

x2i

)+ nx2 − 2nx2 =

n∑i=1

x2i − nx2 =

∑x2i −

[∑xi]

2

n.

The last expression is the best for calculation purposes as it avoids rounding error.

Similarly, we can define the deviance of y as the difference Syy =∑

(yi − y)2. Also the

5

codeviance of x and y can be computed

Sxy =∑

(xi − x)(yi − y)

=∑

xiyi − nxy

=∑

xiyi −∑xi∑yi

n.

Note also that ∑(xi − x) =

∑xi − nx =

∑xi −

∑xi = 0.

This result comes up a number of times and you should certainly know it.

It follows that

Sxy =∑

(xi − x)(yi − y)

=∑

((xi − x)yi −∑

(xi − x)y

=∑

((xi − x)yi

since∑

(xi − x)y = y∑

(xi − x) = 0.

2.1 The Model

We start with the simplest situation where we have one response variable Y and one explana-tory variable X .

In many practical situations we deal with an explanatory variable X that can be controlled(known) and a response variable Y which can be observed (unknown).

We want to predict (or estimate) the mean value of Y for given values of X working from asample on n pairs of observations

{(x1, y1), (x2, y2), . . . , (xn, yn)}.

Example 2.1. Claims and Payments.A sample of ten claims and corresponding payments on settlement for household policies istaken from the business of an insurance company. Data were collected from 10 claims andpayments above (in units of 100 pounds).

Payments may vary for different claims. Claim, X , is known exactly and is not random, butwe may assume that Y is random, so that repeated observations of Y for the same values ofX will vary.

We find∑xi = 35.4,

∑yi = 32.87,

∑x2i = 133.76,

∑y2i = 115.2025 and

∑xiy1 =

123.81

A useful initial stage of modelling is to plot the data.

Figure 2.4 shows the plot of the payments against claims.

6

X: Claim Y : Payment2.10 2.182.40 2.062.50 2.543.20 2.613.60 3.673.80 3.254.10 4.024.20 3.714.50 4.385.00 4.45

2.0 2.5 3.0 3.5 4.0 4.5 5.0

2.0

2.5

3.0

3.5

4.0

4.5

Settlement payments for claims

Claim

Paym

ent

Figure 2.4: Plot of the payments against the claims.

The plot suggests that the payment and claim might be linearly related, although we wouldnot expect the payment increasing linearly over different claims. In this example the linearrelationship can be considered increasing with the size of the claim. �

Other types of function could also describe the relationship well, for example a quadraticpolynomial with a very small second order coefficient. However, it is better to use the simplestmodel which describes the relationship well. This is called the principle of parsimony.

What does it mean “to describe the relationship well”?

We will be working on this problem throughout the course.

If we fit a straight line for the payment variable (Example 2.1), then a deterministic model is

Y = β0 + β1X,

where β0 denotes the intercept and β1 the slope of the line.

7

This however, does not adequately describe the data which show some randomness. To dealwith this problem we introduce a probabilistic element to our model in such a way that everyvalue of the response variable Y consists of two parts:

• one is what we expect to observe for a given value of X ,

• the other is an uncontrolled random value.

Then we can write

Yi = E(Yi|X = xi) + εi, where i = 1, 2, . . . , n.

Hence, here we have

Yi = β0 + β1xi + εi, where i = 1, 2, . . . , n.

We call εi a random error. Standard assumptions about the error are

1. E(εi) = 0 for all i = 1, 2, . . . , n,

2. var(εi) = σ2 for all i = 1, 2, . . . , n,

3. cov(εi, εj) = 0 for all i, j = 1, 2, . . . , n, i 6= j.

The errors are often called departures from the mean. Note that εi is a random variable, henceYi is a random variable and the assumptions can be rewritten as

1. E(Yi|X = xi) = µi = β0 + β1xi for all i = 1, . . . , n,

2. var(Yi|X = xi) = σ2 for all i = 1, . . . , n,

3. cov(Yi|X = xi, Yj|X = xj) = 0 for all i, j = 1, . . . , n, i 6= j.

It means that the dependence of Y on X is linear and the variance of the response Y at eachvalue of X is constant (does not depend on xi) and Yi and Yj are uncorrelated.

Also, it is often assumed that the conditional distribution of Yi is normal. Then, due to theassumption (3) on the covariances, the variables Yi are independent. This is written as

Yi|X = xi ∼ind

N(µi, σ2).

The graph in Figure 2.5 summarizes all the model assumptions.

For simplicity of notation we define

yi := Yi|X = xi. (2.1)

Then the simple linear model can be written as

E(yi) = β0 + β1xi,

var(yi) = σ2.

If we assume normality, we have the called Normal Simple Linear Regression Model denotedin one of the equivalent ways:

8

xi xj X

E(y i)

E(yj)

Y

Figure 2.5: Model Assumptions about the randomness of observations.

• yi ∼ind

N(µi, σ2), where µi = β0 + β1xi, i = 1, 2, . . . , n,

• yi ∼ind

N(β0 + β1xi, σ2), i = 1, 2, . . . , n,

• yi = β0 + β1xi + εi, where εi ∼iidN(0, σ2), i = 1, 2, . . . , n.

For mathematical manipulation it is often convenient to reparameterize the model as follows.

yi = β0 + β1xi − β1x+ β1x+ εi

= (β0 + β1x) + β1(xi − x) + εi

= α + β(xi − x) + εi,

where α = β0 + β1x, β = β1 and x = 1n

∑ni=1 xi.

In this centered form of the model the slope parameter is the identical to the one in non-centered model. The intercept has the interpretation as the mean response at a mean level ofthe explanatory variable.

2.2 Least Squares Estimation

Estimation is a method of finding values of the unknown model parameters for a given data setso as the model fits the data in a “best” way. There are various estimation methods, dependingon how do we define “best”. In this section we consider the Method of Least Squares (LS orLSE).

The LS estimators of the model parameters β0 and β1 minimize the sum of squares of errors,that is

S(β0, β1) =n∑i=1

ε2i =n∑i=1

[yi − (β0 + β1xi)]2. (2.2)

9

The “best” here means the smallest value of S(β0, β1). S is a function of the parameters andso to find its minimum we differentiate it with respect to β0 and β1, then equate the derivativesto zero. { ∂S

∂β0= −2

∑ni=1[yi − (β0 + β1xi)] = 0

∂S∂β1

= −2∑n

i=1[yi − (β0 + β1xi)]xi = 0(2.3)

This set of equations can be written as{nβ0 + β1

∑ni=1 xi =

∑ni=1 yi

β0

∑ni=1 xi + β1

∑ni=1 x

2i =

∑ni=1 xiyi

(2.4)

These are called normal equations. From the first equation, we divided by n and we have assolution

β0 =1

n

n∑i=1

yi − β11

n

n∑i=1

xi

= y − β1x

(2.5)

and, from the second normal equation

β1 =

∑ni=1 xiyi −

1n(∑n

i=1 xi)(∑n

i=1 yi)∑ni=1 x

2i − 1

n(∑n

i=1 xi)2

=

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2

=SxySxx

.

(2.6)

To check that S(β0, β1) attains a minimum at (β0, β1) we calculate second derivatives andevaluate the determinant∣∣∣∣∣∣∣

∂2S∂β2

0

∂S∂β0β1

∂S∂β1β0

∂2S∂β2

1

∣∣∣∣∣∣∣ =

∣∣∣∣∣∣2n 2

∑ni=1 xi

2∑n

i=1 xi 2∑n

i=1 x2i

∣∣∣∣∣∣ = 4nn∑i=1

(xi − x)2 > 0

for all β0, β1.

Also, ∂2S∂β2

0> 0 for all β0, β1. This means that the function S(β0, β1) attains a minimum at

(β0, β1) given by (2.5) and (2.6).

Remark 2.1. Note that the estimators depend on the variable Y ; they are functions of Y whichis a random variable and so the estimators of the model parameters are random variables too.When we calculate the values of the estimators for a given data set, i.e. for observed valuesof Y at given values of X , we obtain so called estimates of the parameters. We may obtaindifferent estimates of β0 and β1 calculated for different data sets fitted by the same kind ofmodel. �

10

Example 2.2. (Payments cont.)For the given data in Example 2.1 we obtain

10∑i=1

yi = 32.87,10∑i=1

xi = 35.4.

10∑i=1

xiyi = 123.81,10∑i=1

x2i = 133.76,

andy =

1

1032.87 = 3.287 and x =

1

1035.4 = 3.54.

Hence, the estimates of the model parameters are

β1 =

∑ni=1 xiyi − nxy∑ni=1 x

2i − nx2

=123.81− 10 · 3.54 · 3.287

133.76− 10 · 3.542= 0.88231,

β0 = 3.287− 0.88231 · 3.54 = 0.164,

and the estimated (fitted) linear model is

yi = 0.164 + 0.88231xi.

From this fitted model we may calculate values of the payments for any claim within theused claim interval. For example, we may estimate the expected payment on settlement for aclaim of 350 pounds (thus we are working in units of 100 pounds, thus a claim of 350 poundscorresponds to x = 3.5) as

yi = 0.164 + 0.88231 · 3.5 = 3.25.

Thus we would expect the settlement payment to be 325 pounds. �

Remark 2.2. Two special cases of the simple linear model are available

• no-intercept model

yi = β1xi + εi,

which implies that E(Yi|X = 0) = 0, and

• constant model

yi = β0 + εi,

which implies that the response variable Y does not depend on the explanatory variableX . �

11

2.3 Properties of the Estimators

Definition 2.1. If θ is an estimator of θ and E[θ] = θ, then we say θ is unbiased for θ.

Note that in this definition θ is a random variable. We must distinguish between θ when itis an estimate and when it is an estimator. As a function of the observed values yi it is anestimate, as a function of the random variables Yi it is an estimator.

The parameter estimator β1 can be written as

β1 =n∑i=1

ciYi, where ci =xi − x∑n

i=1(xi − x)2=xi − xSxx

. (2.7)

We have assumed that Y1, Y2, . . . , Yn are normally distributed and hence using the result thatany linear combination of normal random variables is also a normal random variable, β1

is also normally distributed. This derived from the following theorem, you have studied inProbability and Statistics II:

Theorem 2.1. If Yi ∼ N (µi, σ2i ) and Yi are independent, then

n∑i=1

aiYi ∼ N

(n∑i=1

aiµi,n∑i=1

a2iσ

2i

)

Recall that the model Yi = β0 + β1xi + εi with εi ∼ N (0, σ2), then

Yi ∼ N(β0 + β1xi, σ

2)

We now derive the mean and variance of β1 using the representation (2.7).

E[β1] = E[n∑i=1

ciYi] =n∑i=1

ciE[Yi]

=n∑i=1

ci(β0 + β1xi)

= β0

n∑i=1

ci + β1

n∑i=1

cixi

but∑ci = 0 and

∑cixi = 1 as

∑(xi − x)xi = Sxx, so E[β1] = β1. Thus β1 is unbiased

for β1. Moving to the variance

var[β1] = var

[n∑i=1

ciYi

]

=n∑i=1

c2i var[Yi] since the Y ’s are independent

=n∑i=1

σ2(xi − x)2/[Sxx]2

= σ2/Sxx.

12

Hence, following Theorem 2.1, we have

β1 ∼ N

(β1,

σ2

Sxx

).

Moving to the intercept estimator, note that

β0 = Y − β1x

=1

n

n∑i=1

Yi − xn∑i=1

ciYi

=n∑i=1

Yi

(1

n− cix

)

where ci = (xi − x)/Sxx. Thus the parameter estimator β0 = Y − β1x is also a linearcombination of the Yi’s and hence β0 is normally distributed.

We can also find the mean and variance of β0.

E[β0] = E[Y ]− E[β1x]

= β0 + β1x− xE[β1]

= β0 + β1x− β1x

= β0.

Thus β0 is also unbiased. Its variance is given by

var[β0] = var

[n∑i=1

biYi

]

where bi = 1/n− cix, thus

var[β0] =n∑i=1

b2i var[Yi]

=n∑i=1

σ2b2i

= σ2

n∑i=1

(1/n2 − 2cix/n+ c2i x

2)

= σ2[1/n− 0 +n∑i=1

(xi − x)2x2/{Sxx2}]

= σ2[1/n+ x2/Sxx].

So we have

β0 ∼ N

(β0, σ

2

[1

n+

x2

Sxx

]).

13

2.4 Assessing the Model

2.4.1 Analysis of Variance Table

Parameter estimates obtained for the model

yi = β0 + β1xi + εi

can be used to estimate the mean response corresponding to each variable Yi, that is,

E(Yi|X = xi) = yi = β0 + β1xi, i = 1, . . . , n.

For a given data set (xi, yi), the quantities y1, . . . , yn are called fitted values and are points onthe fitted regression line corresponding to the values of xi.

2.0 2.5 3.0 3.5 4.0 4.5 5.0

2.0

2.5

3.0

3.5

4.0

4.5


Claim

Pay

men

t

Figure 2.6: Observations and fitted line for the settlement payment data.

The observed values yi usually do not fall exactly on the line and so are not equal to the fittedvalues yi, as is shown in Figure 2.6.

The residuals (also called crude residuals) are defined as

ei := yi − yi, i = 1, . . . , n, (2.8)

where yi is the observed value and yi is the fitted value. These are estimates of the randomerrors εi and they satisfy

ei = yi − (β0 + β1xi)

= yi − y − β1(xi − x).

andn∑i=1

ei =n∑i=1

(yi − y)− β1

n∑i=1

(xi − x)

= 0. (2.9)

14

Also note that the quantity S, which we minimized to obtain the least squares estimators ofthe model parameters, is defined as

S =n∑i=1

ε2i =n∑i=1

[yi − (β0 + β1xi)]2

where S is a function of β0 and β1 and has a quadratic form. When evaluated for a given dataset at the least squares estimates of β0 and β1, it is called the Residual Sum of Squares and isdenoted by SSE , that is,

SSE =n∑i=1

[yi − (β0 + β1xi)]2 =

n∑i=1

(yi − yi)2 =n∑i=1

e2i . (2.10)

This, for a given data set, is the minimum value of the function S(β0, β1) and it measures howclosely the model fits the data. It measures the variability in the data around the fitted model.

Consider the constant modelyi = β0 + εi.

For this model yi = β0 = y and, for a given data set, we have the fitted value and the residualsas

yi = y, ei = yi − yi = yi − yand

SSE = SST =n∑i=1

(yi − y)2.

This measures the total variation in y around its mean; it is called the Total Sum of Squaresand is denoted by SST . For a constant model SSE = SST . When the model is non constant,i.e. there is a significant slope, the difference yi − y can be split into two components: onedue to the regression model fit and one due to the residuals, that is

yi − y = (yi − yi) + (yi − y),

see Figure 2.7.

The following theorem gives such an identity for the respective sums of squares.

Theorem 2.2. Analysis of Variance Identity.In the simple linear regression model the total sum of squares is a sum of the regression sumof squares and the residual sum of squares, that is

SST = SSR + SSE, (2.11)

where

SST =n∑i=1

(yi − y)2

SSR =n∑i=1

(yi − y)2

SSE =n∑i=1

(yi − yi)2

15

2.0 2.5 3.0 3.5 4.0 4.5 5.0

2.0

2.5

3.0

3.5

4.0

4.5


Claim

Paym

en

t

Figure 2.7: Observations (+), fitted line (purple dashed line), fitted values (red dots) for thenon-constant model and the fitted line (blue dashed line) for a constant model.

Proof.

SST =n∑i=1

(yi − y)2 =n∑i=1

[(yi − yi) + (yi − y)]2

=n∑i=1

[(yi − yi)2 + (yi − y)2 + 2(yi − yi)(yi − y)]

= SSE + SSR + 2A,

16

where

A =n∑i=1

(yi − yi)(yi − y)

=n∑i=1

(yi − yi)yi − yn∑i=1

(yi − yi)

=n∑i=1

eiyi − yn∑i=1

ei︸︷︷︸=0 by (2.3)

=n∑i=1

ei(β0 + β1xi)

= β0

n∑i=1

ei︸︷︷︸=0 by (2.3)

+ β1

n∑i=1

eixi︸︷︷︸=0 by (2.3)

.

Hence A = 0. �

The model (regression) fit sum of squares, SSR, represents the variability in the yi accountedfor by the fitted model, the residual sum of squares, SSE , represents the variability in the yiaccounted for by the differences between the observations and the fitted values.

The Analysis of Variance (ANOVA) Table (see Table 2.2) shows the sources of variation,the sums of squares and the statistics based on the sums of squares and allows us to test thesignificance of the regression slope.

Source of variation d.f. SS MS VRRegression νR = 1 SSR MSR = SSR

νRF = MSR

MSE

Residual νE = n− 2 SSE MSE = SSEνE

Total νT = n− 1 SST

Table 2.2: ANOVA Table

Looking at Table 2.2, the “d.f.” is short for “degrees of freedom”.

What are degrees of freedom?

For an intuitive explanation consider the observations y1, y2, . . . , yn and assume that their sumis fixed, say equal to a, that is

y1 + y2 + . . .+ yn = a.

For a fixed value of the sum a there are n− 1 arbitrary y-values but one y-value is determinedby the difference of a and the n − 1 arbitrary y values. This one value is not free, it dependson the other y-values and on a. We say, that there are n− 1 independent (free to vary) pieces

17

of information and one piece is taken up by a.

Estimates of parameters can be based upon different amounts of information. The numberof independent pieces of information that go into the estimate of a parameter is called thedegrees of freedom. This is why in order to calculate

SST =n∑i=1

(yi − y)2

we have n−1 free to vary pieces of information from the collected data, that is we have n−1degrees of freedom. The one degree of freedom is taken up by y. Similarly, for

SSE =n∑i=1

(yi − yi)2 =n∑i=1

(yi − β0 − β1xi)2

we have two degrees of freedom taken up: one by β0 and one by β1 (both depend on y1, y2, . . . , yn).Hence, there are n− 2 independent pieces of information to calculate SSE .

Finally, as SSR = SST − SSE we can calculate the d.f. for SSR as a difference between d.f.for SST and for SSE , that is νR = (n− 1)− (n− 2) = 1.

In the ANOVA table (Table 2.2), there are also included the so called Mean Squares (MS),which can be thought of as measures of average variation and are calculated by dividing thesum of squares (SS) by the degree of freedom (d.f.).

The last column of the table contains the Variance Ratio (VR), which is the ratio between theregression mean square to the error mean square, thus

F =MSRMSE

and it measures the variation explained by the model fit relative to the variation due to resid-uals.

2.4.2 F test

Recall that the F-Distribution is a skewed distribution, which depends on 2 parameters: ν1 andν2 degrees of freedom. If X ∼ χ2(ν1) and Y ∼ χ2(ν2) are two independent random variableswith Chi-squared distribution, then

F =X/ν1

Y/ν2

is a F-distribution with ν1 (for the numerator) and ν2 (for the denominator) degrees of free-dom.The F-distribution or the “Fisher’s F distribution” is represented as

Fν1ν2 or Fν1,ν2 or F(ν1, ν2)

18

We will see later, that if β1 = 0, then

F =MSRMSE

∼ F1n−2.

Thus, to test the null hypothesisH0 : β1 = 0

versus the alternativeH1 : β1 6= 0,

we use the variance ratio F as the test statistic. Under H0 the ratio has F distribution with 1and n− 2 degrees of freedom.

We reject H0 at a significance level α if

Fcal > F1n−2(α),

where Fcal denotes the value of the variance ration F calculated for a given data set andF1n−2(α) is such that

P (F > F1n−2(α)) = α.

There is no evidence to reject H0 if Fcal < F1n−2(α).

Rejecting H0 means that the slope β1 6= 0 and the full regression model

yi = β0 + β1xi + εi

is better then the constant modelyi = β0 + εi.

2.4.3 Estimating σ2

Note that the sums of squares in the ANOVA table (see Table 2.2) are functions of yi, i =1, . . . , n, which are random variables Yi|X = xi. Hence, the sums of squares are randomvariables as well. This fact allows us to check some stochastic properties of the sums ofsquares, such as their expectation, variance and distribution.

Theorem 2.3. In the full simple linear regression model we have

E(SSE) = (n− 2)σ2

Proof. Proof will be given later. �

From Theorem 2.3, we obtain

E(MSE) = E

(1

n− 2SSE

)= σ2,

19

Figure 2.8: F distribution, where the shade region corresponds to F-values greater than orequal to study F-value. Thus the F-value falls in this are when the H0 is true.

thus MSE is an unbiased estimator of σ2. It is often denoted by S2.

Notice, that in the full model S2 is not the sample variance. We have

S2 = MSE =1

n− 2

n∑i=1

(yi − µi)2.

It is the sample variance in the constant or null model, where yi = y and νE = n− 1. Then

S2 =1

n− 1

n∑i=1

(yi − y)2.

2.4.4 Coefficient of Determination

The coefficient of determination, denoted by R2, is the percentage of total variation in yiexplained by the fitted model, that is

R2 =SSRSST

100% =SST − SSE

SST100% =

(1− SSE

SST

)100%. (2.12)

Note:

20

• R2 ∈ [0, 100].

• R2 = 0 indicates that none of the variability in the data (y) is explained by the regressionmodel.

• R2 = 100 indicates that SSE = 0 and all observations fall on the fitted line exactly.

R2 is a measure of the linear association between Y andX . A smallR2 does not always implya poor relationship between Y and X , which may, for example, follow a quadratic model.

2.5 Fitted Values and Residuals

We begin this section by recapping information about the fitted values and residuals.

The fitted values yi, for i = 1, 2, . . . , n are the estimated values of E[Yi] = β0 + β1xi, so

yi = β0 + β1xi.

The residuals (also called crude residuals) are defined as ei = Yi − yi, thus

ei = Yi − (β0 + β1xi)

= Yi − Y − β1(xi − x).

2.5.1 Properties of Residuals

Recall that the sum of residuals isn∑i=1

ei =n∑i=1

(Yi − Y )− β1

n∑i=1

(xi − x)

= 0.

Since the residuals, under the model, are functions of the random variable Yi’s, then we cancompute the mean of the ith residual as

E[ei] = E[Yi − β0 − β1xi]

= E[Yi]− E[β0]− xiE[β1]

= β0 + β1xi − β0 − β1xi

= 0.

The variance of the residuals is given by

var[ei] = σ2

[1− 1

n− (xi − x)2

Sxx

],

which can be shown by writing ei as a linear combination of the Yi’s. Note that it depends oni, thus the variance of ei is not quite constant unlike that of εi.

21

Similarly it can be shown that the covariance of two residuals ei and ej is

cov[ei, ej] = −σ2

[1

n+

(xi − x)(xj − x)

Sxx

].

which is in contrast with the results of the error terms, that var[εi] = σ2 and cov[εi, εj] = 0.So the crude residuals ei do not quite mimic the properties of εi. In conclusion the error term,εi, comes from the model, while the residual, ei, after fitting the model to the data.

2.5.2 Standardised residuals

Definition 2.2. We define the standardised residual

di =ei

[s2(1− vi)]12

where

vi =1

n+

(xi − x)2

Sxx

The standardised residuals have a more nearly constant variance and a smaller covariancethan the crude residuals.

22

2.5.3 Residual plots

The shape of various residual plots can show if the normality assumptions are approximatelymet, but they can also reveal if the linear model is a good fit.

To check linearity, we plot di against xi, as it is shown in Figure 2.9.

di

xi

(a) (b)

Figure 2.9: (a) No problem apparent (b) Clear non-linearity

To check constancy of variance (homoscedasticity), we plot di against the fitted values yi, asit is shown in Figure 2.10.

(a) (b)

Figure 2.10: (a) No problem apparent (b) Variance increases as the mean response increases

23

To check the assumption of normality we use a normal QQ plot. If the data are from a normaldistribution the plot should be close to a straight line. Other distributions will tend to havepoints away from the line. In the Figures 2.11 and 2.12 are some examples of data simulatedfrom various distributions with the resulting QQ plots. Although the QQ plot can often give agood indication if the distribution of the errors is normal it is usually best to formally test thathypothesis. We shall see how to do that later.

Figure 2.11: Normal QQ plot

2.5.4 Settlement Payment example using R

Example 2.3. The R code and output is as follows

> Claim<- c(2.10, 2.40, 2.50, 3.20, 3.60, 3.80,4.10, 4.20, 4.50, 5.00)> Payment <- c(2.18, 2.06, 2.54, 2.61, 3.67, 3.25,4.02, 3.71, 4.38, 4.45)> model <- lm(Payment~Claim)> summary(model)

Call:lm(formula = Payment ~ Claim)

24

Figure 2.12: Normal QQ plot

Residuals:Min 1Q Median 3Q Max

-0.37702 -0.20571 0.01918 0.22183 0.33006


(Intercept) 0.16363 0.34048 0.481 0.644Claim 0.88231 0.09309 9.478 1.27e-05 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2705 on 8 degrees of freedomMultiple R-squared: 0.9182,Adjusted R-squared: 0.908F-statistic: 89.82 on 1 and 8 DF, p-value: 1.265e-05

> anova(model)

25

Analysis of Variance Table

Response: PaymentDf Sum Sq Mean Sq F value Pr(>F)

Claim 1 6.5734 6.5734 89.824 1.265e-05 ***Residuals 8 0.5854 0.0732---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

2.0 2.5 3.0 3.5 4.0 4.5 5.0

2.0

2.5

3.0

3.5

4.0

4.5


Claim

Paym

en

t

Figure 2.13: Fitted line plot for Settlement Payment data

Comments:We fitted a simple linear model of the form

yi = β0 + β1xi + εi, i = 1, . . . , 10, εi ∼iidN(0, 1).

The estimated values of the parameters are

- intercept: β0∼= 0.1636

- slope: β1∼= 0.88231

Only the slope parameter is significant (p < 0.001), while the intercept is not significant.

The ANOVA table also shows the significance of the regression (slope), that is the null hy-pothesis

H0 : β1 = 0

26

versus the alternativeH1 : β1 6= 0

can be rejected at the significance level α < 0.001.

The tests require the assumption of the normality of the random errors. It should be checkedif such an assumption is approximately fulfilled. If it is not, the tests may not be valid (variousmethods for checking the assumption will be discussed in the lectures later on).

The value of R2 is very high, i.e., R2 = 91.82%. It means that the fitted model explains thevariability in the observations very well.

The graph shows that the observations lie along the fitted line and there are no single strangepoints which are far from the line or which could strongly affect the slope.

Final conclusions:We can conclude that the data indicate that the settlement payments depends linearly on theirclaims (within the range 2 - 5). The mean increase in the settlement payment for claim isestimated as β1

∼= 0.88.

However, it might be wrong to predict the settlement payment or its increase per claim outsidethe range of the observed claims. We would expect that the growth slows down over claimsand so the relationship becomes non-linear. �

2.6 Inference about the regression parameters

Note that for the simple linear regression model

yi = β0 + β1xi + εi, where εi ∼iidN(0, σ2), (2.13)

we obtained the following LSE of the parameters β0 and β1:

β0 = y − β1x

β1 =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2

We now derive results which allow us to make inferences about the regression parameters andpredictions.

2.6.1 Inference about β1

We proved the following result in Section 2.3.

Theorem 2.4. In the model (2.13) the sampling distribution of the LSE β1 of β1 is normalwith the expectation E(β1) = β1 and the variance var(β1) = σ2

Sxx, that is

β1 ∼ N

(β1,

σ2

Sxx

). (2.14)

27

Remark 2.3. For large samples, where there is no assumption of normality of yi, the samplingdistribution of β1 is approximately normal.

Theorem 2.4 allows us to derive a confidence interval (CI) and a test of non-significance forβ1. After standarisation of β1 we obtain

β1 − β1

σ/√Sxx∼ N(0, 1).

However, the error variance usually is not known and it is replaced by its estimate. Then thenormal distribution changes to a Student t-distribution. The explanation is following.

Lemma 2.1. If Z ∼ N(0, 1) and U ∼ χ2ν , and Z and U are independent, then

Z√U/ν

∼ tν .

Here we have,

Z =β1 − β1

σ/√Sxx∼ N(0, 1).

We will see later that

U =(n− 2)S2

σ2∼ χ2

n−2

and S2 and β1 are independent. It follows that

T =

β1−β1σ/√Sxx√

(n−2)S2

σ2(n−2)

=β1 − β1

S/√Sxx∼ tn−2. (2.15)

Confidence interval for β1

To find a CI for an unknown parameter θ means to find boundaries a and b such that

P (a < θ < b) = 1− α

for some small α, that is for a high confidence level (1 − α)100%. The boundaries willdepend on the data and so using the parameter estimates is a natural way of finding the CI.From Equation (2.15) we have

P

(−tα

2<β1 − β1

S/√Sxx

< tα2

)= 1− α, (2.16)

where tα2

is such that P (|tν | < tα2) = 1− α.

Rearranging the expression in brackets of Equation (2.16) gives

P

(β1 − tα

2

S√Sxx

< β1 < β1 + tα2

S√Sxx

)= 1− α. (2.17)

28

This is a probability statement about the random variables β1 and S. When we replace themby their values from the observed data we obtain the CI for β1 as

[a, b] =

[β1 − tα

2

S√Sxx

, β1 + tα2

S√Sxx

]. (2.18)

Example 2.4. Overheads.A company builds a custom electronic instruments and computer components. All jobs aremanufactured to customer specifications. The firm wants to be able to estimate its overheadcost. As part of a preliminary investigation, the firm decides to focus on a particular depart-ment and investigates the relationship between total departmental overhead cost (y) and totaldirect labour hours (x). The data for the most recent 16 months are plotted in Figure 2.14.

700 800 900 1000 1100 1200 1300 x

23000

25000

27000

29000

31000

33000

y

Figure 2.14: Plot of overheads data

Two objectives of this investigation are

1. Summarize for management the relationship between total departmental overhead andtotal direct labor hours.

2. Estimate the expected and predict the actual total departmental overhead from the totaldirect labour hours.

The regression equation isy = 16310 + 11.0 x

Predictor Coef SE Coef T PConstant 16310 2421 6.74 0.000x 10.982 2.268 4.84 0.000

S = 1645.61 R-Sq = 62.6% R-Sq(adj) = 60.0%

Analysis of VarianceSource DF SS MS F PRegression 1 63517077 63517077 23.46 0.000

29

Residual Error 14 37912232 2708017Total 15 101429309

Comments:

• The model fit is yi = 16310 + 11xi. There is a significant relationship between theoverheads and the labour hours (p < 0.001 in ANOVA).

• The increase of labour hours by 1 will increase the mean overheads by about £11 (β1 =11.0).

• There is rather large variability in the data; the percentage of total variation explainedby the model is rather small (R2 = 62.6). It might be worth considering other factorswhich may also influence the costs (and then include them into the model).

The model allows us to estimate the total overhead cost, but as we noticed, there is largevariability in the data. In such a case, the point estimates may not be very reliable. Anyway,point estimates should always be accompanied by their standard errors. Then we can also findconfidence intervals (CI) for the unknown model parameters, or test their non-significance.

The CI calculated values for the overhead costs Example 2.4 are the following

β1 = 10.982, S = 1645.61, Sxx = 526656.9 t14,0.025 = 2.14479.

Hence, the 95% CI for β1 is[10.982− 2.14479

1645.61√526656.9

, 10.982 + 2.144791645.61√526656.9

]= [6.11851, 15.8455]

Note that this means we would not reject any null hypothesis that β1 took any of these valuesat the 5% significance level. �

Test of H0 : β1 = 0

The null hypothesis H0 : β1 = 0 means that the slope is zero and the true model is a constantmodel

yi = β0 + εi,

i.e., there is no relationship between Y and X. From (2.15) we see that if H0 is true, then thestatistic

T =β1

S√Sxx

∼ tn−2 (2.19)

can be used as a test function for this null hypothesis.

We reject H0 at a significance level α when the calculated, for a given data set, value of thetest function, Tcal, is in the rejection region, that is

|Tcal| > tn−2,α2.

30

Many statistical software give the p-value when testing a hypothesis. When the p-value issmaller then α then we may reject the null hypothesis on a significance level ≤ α.

We could similarly carry out hypothesis tests that β1 took any particular value.

Remark 2.4. The square root of the variance var(β1) is called the standard error of β1, that is

se(β1) =

√σ2

Sxx.

It is estimated by

se(β1) =

√S2

Sxx.

Often this estimated standard error is called the standard error. It is not correct and you shouldbe aware of the difference between the two.

Remark 2.5. Note that the (1− α)100% CI for β1 can be written as[β1 − tn−2,α

2se(β1), β1 + tn−2,α

2se(β1)

]and the test statistic for H0 : β1 = 0 as

T =β1

se(β1).

As we have noted before we can also test the hypothesis H0 : β1 = 0 using the Analysis ofVariance table and an F test. The two tests are actually equivalent since if the random variableW ∼ tν then W 2 ∼ F 1

ν .

2.6.2 Inference about β0

Since we are studying the relationship between X and Y , we are most interested in β1. How-ever, we can also carry out inference about β0. The LSE of β0 is

β0 = y − β1x.

We have proved the following theorem in Section 2.3.

Theorem 2.5. In the SLM the sampling distribution of the LSE of β0 is normal with theexpectation E(β0) = β0 and the variance var(β0) =

(1n

+ x2

Sxx

)σ2, that is

β0 ∼ N

(β0,

(1

n+

x2

Sxx

)σ2

). (2.20)

Corrolary 2.1. Assuming the full simple linear regression model, we obtain

31

CI for β0: [β0 − tn−2,α

2se(β0), β0 + tn−2,α

2se(β0)

]Test of the hypothesis H0 : β0 = β00:

T =β0 − β00

se(β0)∼H0

tn−2,

where

se(β0) =

√S2

(1

n+

x2

Sxx

).

�

The calculated values for the overhead costs Example 2.4 are following

β0 = 16310, se(β0) = 2421

Hence, the 95% CI for β0 is

[16310− 2.14479× 2421, 16310 + 2.14479× 2421]

= [11117.5, 21502.5].

2.6.3 Inference about E(Yi|X = xi)

In the simple linear regression model, we have

µi = E(Yi|X = xi) = β0 + β1xi

and its LSE isµi = E(Yi|X = xi) = β0 + β1xi.

We may estimate the mean response at any value of X which is within the range of the data,say x0. Then,

µ0 = E(Y0|X = x0) = β0 + β1x0.

Similarly as for the LSE of β0 and for β1 we have the following Theorem.

Theorem 2.6. In the SLM the sampling distribution of the LSE of µ0 is normal with theexpectation E(µ0) = µ0 and the variance var(µ0) =

(1n

+ (x0−x)2

Sxx

)σ2, that is

µ0 ∼ N

(µ0,

(1

n+

(x0 − x)2

Sxx

)σ2

). (2.21)

Corrolary 2.2. Assuming the full simple linear regression model, we obtain

32

CI for µ0: [µ0 − tn−2,α

2se(µ0), µ0 + tn−2,α

2se(µ0)

]Test of the hypothesis H0 : µ0 = µ∗:

T =µ0 − µ∗

se(µ0)∼H0

tn−2,

where

se(µ0) =

√S2

(1

n+

(x0 − x)2

Sxx

).

�

Remark 2.6. Care is needed when estimating the mean at x0. It should only be done if x0 iswithin the data range. Extrapolation beyond the range of the given x-values is not reliable, asthere is no evidence that a linear relationship is appropriate there.

2.6.4 Prediction Interval for a new observation

Apart from making inference on the mean response we may also try to do it for a new responseitself, that is for an unknown (not observed) response at some x0. For example, we might wantto predict an overhead cost for a department whose total labor hours are x0, see example 2.4.In this section we derive a Prediction Interval (PI) for such a response

y0 = µ0 + ε0

for which the point prediction is y0 = µ0 = β0 + β1x0.

By Theorem (2.6) we haveµ0 ∼ N(µ0, aσ

2),

where a =(

1n

+ (x0−x)2

Sxx

). Hence,

µ0 − µ0 ∼ N(0, aσ2).

What is the distribution of µ0 − y0 = y0 − y0? Adding and subtracting the error term weobtain

µ0 − (µ0 + ε0) + ε0︸︷︷︸:=Z

∼ N(0, aσ2).

We know thatε0 ∼ N(0, σ2)

Hence,y0 − y0 = Z − ε0 ∼ N(0, aσ2 + σ2).

That is

y0 − y0 ∼ N

(0, σ2

{1 +

1

n+

(x0 − x)2

Sxx

})33

and soy0 − y0√

σ2{

1 + 1n

+ (x0−x)2

Sxx

} ∼ N(0, 1). (2.22)

Replacing σ2 by its estimator S2 gives

y0 − y0√S2{

1 + 1n

+ (x0−x)2

Sxx

} ∼ tn−2.

Hence, a (1− α)100% PI for y0 is

y0 ± tn−2,α2

√S2

{1 +

1

n+

(x0 − x)2

Sxx

}This interval is usually much wider than the CI for the mean response µ0. This is because ofthe random error ε0 reflecting the random source of variability in the data. Again, we shouldonly make predictions for values of x0 within the range of the data.

For the overheads data of x0 = 1000 hours then the predicted value is 27292, the 95% confi-dence interval is (26374, 28210) and the 95% prediction interval (23645, 30939).

2.7 Further Model Checking

2.7.1 Outliers and influential observations

An outlier, in the context of regression is an observation whose standardised residual is large(in absolute value) compared with the rest of the data. Recall the definition of the standardisedresidual as

di =(yi − yi)s√

(1− vi).

An outlier will usually be apparent from any of the residual plots (see Figure 2.15). Note thatwe can have a single outlier together with a normal plot that suggests the data are light-tailed.

Figure 2.15: Example of an outlier in red.

Minitab prints a warning about observations with a standardised residual greater than (inabsolute value) 2.00. For most datasets such an observation is not really an outlier. The null

34

distribution of the largest standardised residual is complicated but using a 5% significancetest the critical value for simple linear regression is given in the following table for differentsample sizes n.

If we find an outlier we should check whether the observation was misrecorded or miscopied,if so correct it. If it seems correctly recorded we should rerun the analysis excluding theoutlier. If the conclusions from the second analysis differs substantially from the first one weshould report both.

As well as outliers in the y values, we sometimes have values of x which are different to therest. This is probably more of a potential problem when we have more than one regressorvariable (multiple regression) but we shall discuss this also in the context of simple linearregression. To detect an observation with an unusual x value we use the leverage. This isdefined as

vi =1

n+

(xi − x)2

Sxx.

We have met it before in defining the standardised residual. Note that∑

i vi = 2 so on averagean observation will have a leverage of 2/n. We shall regard an observation with a leveragegreater that 4/n has having a large leverage and greater than 6/n as a very large leverage.

An observation with a large leverage is not a wrong observation (although if the leverage isvery large it is probably worth checking that the x value has been recorded correctly). Ratherit is a potentially influential observation, i.e. one whose omission would cause a big changein the parameter estimates (see Figure 2.16).

Figure 2.16: Example of a large leverage.

We can use a statistic, called Cook’s statistic Di, or Cook’s distance, to measure the influenceof an observation. This can be defined in a number of ways. For a simple linear regressionmodel consider omitting the ith observation (xi, yi) and refitting to get fitted values denotedby y(i). We define Cook’s statistic for case i to be

Di =1

2S2

n∑j=1

(y(i)j − yj)2

n 6 7 8 9 10 14 20 25 40 60max |di| 1.93 2.08 2.20 2.29 2.37 2.58 2.77 2.88 3.08 3.23

Table 2.3: Critical values to declare an observation as an outlier at 5% significance level

35

It can be shown that

Di =d2i

2

vi1− vi

.

This shows that Di depends on both the size of the standardised residual di and the leveragevi. So a large value of Di can occur due to a large di or large vi or both.

An approximate way to determine if Di is surprisingly large can be done by determiningwhether Di is larger than the 50th percentile of an F 2

n−2 distribution. If so it has a major influ-ence on the fitted value. Even if the largest Di is not larger than this value the correspondingobservation could still be considered influential if it is a lot larger than the second largest.

It is not recommended that influential observations be removed, but they indicate that somedoubt should be expressed about the conclusions since without the influential observation theconclusions might be rather different.

2.7.2 Example - Gessel scores

During an investigation into Cyanotic heart disease a comparison was made between theGesell adaptive score (similar to an IQ test) and age at first word (measured in months) for 21children. The data is plotted in Figure 2.17.

Figure 2.17: Plot of Gesell data

The R output is

> gesell <-read.csv("gesell.csv")

36

> x<-gesell[ ,1]> y<-gesell[ ,2]> print(x)[1] 15 26 10 9 15 20 18 11 8 20 7 9 10 11 11 10 12 42 17[20] 11 10> print(y)[1] 95 71 83 91 102 87 93 100 104 94 113 96 83 84[15] 102 100 105 57 121 86 100> plot(x,y, main="Y versus X, gesell")> mody <- lm(y ~ x)> summary(mody)

Call:lm(formula = y ~ x)


-15.604 -8.731 1.396 4.523 30.285


(Intercept) 109.8738 5.0678 21.681 7.31e-15 ***x -1.1270 0.3102 -3.633 0.00177 **---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 11.02 on 19 degrees of freedomMultiple R-squared: 0.41,Adjusted R-squared: 0.3789F-statistic: 13.2 on 1 and 19 DF, p-value: 0.001769

> anova(mody)Analysis of Variance Table

Response: yDf Sum Sq Mean Sq F value Pr(>F)

x 1 1604.1 1604.1 13.202 0.001769 **Residuals 19 2308.6 121.5---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1> (abline (109.87, -1.127, lty=3))NULL> stdres1 <-rstandard(mody)> fits1<-fitted(mody)> plot(x,stdres1, main="Std res vs x, gesell")> qqnorm(stdres1, main="Q-Q Plot")

37

> qqline(stdres1)> print(stdres1)

1 2 3 4 50.18883222 -0.94440639 -1.46226437 -0.82158155 0.83965939

6 7 8 9 10-0.03147039 0.31891861 0.23566531 0.29716139 0.62796572

11 12 13 14 151.04797524 -0.35108151 -1.46226437 -1.25882099 0.42247610

16 17 18 19 200.13082533 0.80601240 -0.85153932 2.82336807 -1.07201020

210.13082533

We plot the QQ plot and the standardized residuals for checking the linearity and normality.From Figure 2.18, we have that 19 has a very large values (2:82) and it is an outlier from thetable.

10 15 20 25 30 35 40

−1

01

2

std res vs x, Gesell

X

std

res1

−2 −1 0 1 2

−1

01

2

Q−Q plot

Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 2.18: Linearity (left) and Normality checks by using the standardized residuals and theQ-Q plot.

Moving to the Cook’s distance and the Leverage values, we have

> hat <- hatvalues(mody)> cook<-cooks.distance(mody)> i<-1:21> plot(i,hat, main="Leverage values")> plot(i,cook, main="Cooks distance values")> qf(0.50, 2, 19)[1] 0.7190606

Figure 2.19 shows the results for the Cook’s distance and the leverage values.

We can see that observation 19 has a very large standardised residual and is an outlier ac-cording to the table presented in lectures. Observation 18 has a very large leverage value it

38

5 10 15 20

0.1

0.2

0.3

0.4

0.5

0.6

Leverage values

i

ha

t

5 10 15 20

0.0

0.2

0.4

0.6

Cooks distance values

i

co

ok

Figure 2.19: Plot of leverage values (left) and Cook’s distance (right) for Gesell data

exceeds 6/21. Observation 18 has the largest Cook’s distance. The 50th percentile point of anF 2

19 distribution is 0.719. Observation 18 does not quite exceed this value but it is considerablylarger than the other values and can be regarded as influential.

If we delete the outlier, observation 19, the estimates of β0 and β1 don’t change a lot, althoughthat for S does.

> x2<-x[-19]> y2<-y[-19]> print(x2)[1] 15 26 10 9 15 20 18 11 8 20 7 9 10 11 11 10 12 42 11[20] 10> y2<-y[-19]> mody2 <- lm(y2 ~ x2)> summary(mody2)

Call:lm(formula = y2 ~ x2)


-14.372 -7.350 2.628 5.337 12.049


(Intercept) 109.3047 3.9700 27.533 3.64e-16 ***x2 -1.1933 0.2435 -4.901 0.000115 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


39

> stdres3 <-rstandard(mody2)>print(stdres3)

1 2 3 4 50.4275801 -0.9203938 -1.7220088 -0.9101149 1.2601460

6 7 8 9 100.1883101 0.6190081 0.4564682 0.5128596 1.0324596

11 12 13 14 151.4653314 -0.3085755 -1.7220088 -1.4545700 0.6953480

16 17 18 19 200.3149397 1.1934241 -0.4365044 -1.2156902 0.3149397

10 15 20 25 30 35 40

−1

.5−

0.5

0.5

1.0

1.5


x2

std

res2

−2 −1 0 1 2

−1

.5−

0.5

0.5

1.0

1.5

Q−Q plot


Sa

mp

le Q

ua

ntile

s

5 10 15 20

0.1

0.2

0.3

0.4

0.5

0.6

Leverage values

i

ha

t2

5 10 15 20

0.0

00

.05

0.1

00

.15


i

co

ok2

Figure 2.20: Top: Linearity and Normality check. Bottom: Leverage values and Cook’sdistance when the observation 19 is deleted.

If in the original dataset we delete the observation with large leverage and highest Cook’sdistance, observation 18, the estimates of the regression parameters do change quite a bit. Infact the slope parameter β1 is now not significant.

> x1<-x[-18]> y1<-y[-18]> print(x1)[1] 15 26 10 9 15 20 18 11 8 20 7 9 10 11 11 10 12 17 11

40

[20] 10> mody1 <- lm(y1 ~ x1)> summary(mody1)

Call:lm(formula = y1 ~ x1)


-14.838 -8.477 1.779 4.688 28.617


(Intercept) 105.6299 7.1619 14.749 1.71e-11 ***x1 -0.7792 0.5167 -1.508 0.149---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


> stdres2 <-rstandard(mody1)> print(stdres2)

1 2 3 4 50.09822136 -1.69275072 -1.38489076 -0.71679044 0.74780801

6 7 8 9 10-0.29847589 0.13280175 0.27297101 0.43793692 0.38757318

11 12 13 14 151.23646407 -0.24626304 -1.38489076 -1.21179848 0.45856720

16 17 18 19 200.20182434 0.80649481 2.69300552 -1.02620229 0.20182434

2.7.3 Transformation of the response

If the model checking suggests that the variance is not constant, or that the data are notfrom a normal distribution (these often happen together) then it might be possible to obtain abetter model by transforming the observations yi. We have seen one example in practical 2.Commonly used transformations are

• ln y; this is particularly good if Var[Yi] ∝ E[Yi]2.

• √y; this is particularly good if Var[Yi] ∝ E[Yi].

• sin−1(√y).

41

10 15 20 25

−1

01

2


x3

std

res3

−2 −1 0 1 2

−1

01

2

Q−Q plot


Sa

mp

le Q

ua

ntile

s

5 10 15 20

0.1

0.2

0.3

0.4

Leverage values

i

ha

t3

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0


i

co

ok3

Figure 2.21: Top: Linearity and Normality check. Bottom: Leverage values and Cook’sdistance when the observation 18 is deleted.

• 1/y.

The arc-sine transformation can be used if the data are proportions and the square root trans-formation when the data are counts.

In practice the log transformation is often the most useful and is generally the first transfor-mation we try. But note the data needs to be non-negative.

2.7.4 Example - Plasma level of polyamine

The plasma level of polyamine Y was observed in 25 children of age X 0 (newborn) to 4years old. The results are given in Table below.

We are interested whether the level of polyamine decreases linearly with the age of children?

The fitted line plot is given in Figure 2.22.

R output is:


(Intercept) 13.2400 0.7800 16.975 1.67e-14 ***

42

Figure 2.22: Fitted line plot for plasma data

x -2.1190 0.3184 -6.655 8.66e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1




x 1 224.51 224.51 44.283 8.66e-07 ***Residuals 23 116.61 5.07

The p-value in the table of coefficients for the slope shows that the regression is highly signifi-cant. However, dependence of the polyamine level in plasma on age of children is more likelyto decrease exponentially to a small value rather than linearly and there is some evidence forthis on the fitted line plot.

We also see from the plot of residuals versus fitted values (see Figure 2.23) that the spread ofthe data is not constant and this casts doubt on the assumption of a constant variance. There

x = 0 20.12 16.10 10.21 11.24 13.35x = 1 8.75 9.45 13.22 12.11 10.38x = 2 9.25 6.87 7.21 8.44 7.55x = 3 6.45 4.35 5.58 7.12 8.10x = 4 5.15 6.12 5.70 4.25 7.98

Table 2.4: Plasma levels data

43

Figure 2.23: Residual plots for plasma data

is also some evidence in this plot of curvature. The normal plot shows some skewness in thedistribution.

Taking a log transformation of the data we obtain the following fitted line plot as in Fig-ure 2.24.

Figure 2.24: Fitted line plot for logged plasma data

The fitted line does seem to be better. The R output is below. The regression is still highlysignificant. The value of R2 has increased so this is another indication of a better fit.


(Intercept) 2.58292 0.07249 35.629 < 2e-16 ***x -0.23045 0.02960 -7.787 6.8e-08 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2093 on 23 degrees of freedom

44

Multiple R-squared: 0.725,Adjusted R-squared: 0.713F-statistic: 60.63 on 1 and 23 DF, p-value: 6.799e-08


Response: lyDf Sum Sq Mean Sq F value Pr(>F)

x 1 2.6554 2.6554 60.631 6.799e-08 ***Residuals 23 1.0073 0.0438

The residual plots also show a better indication of normality and the linearity in x seems bettertoo.

2.7.5 Pure Error and Lack of Fit

We have seen that the residuals for the plasma data are not likely to be a sample from a normaldistribution with a constant variance. One of the reasons can be that the straight line is not agood choice of the model. This fact can be easily seen here, but we can also check lack of fitmore formally. This is possible when we have replications, that is more than one observationfor some values of the explanatory variable. Here we have five observations for each age xi.

Notation:Denote by yij the j-th observation at xi, i = 1, . . . ,m, j = 1, . . . , ni, that is the number of allobservations is n =

∑mi=1 ni. The average value of observations at xi is

yi =1

ni

ni∑j=1

yij.

We denote the fitted value at xi by yi, which is the same for all observations at xi. �

The residuals eij measure the differences between the observed value yij and the fitted valueyi, i.e.

eij = yij − yi.

These differences arise for two reasons. Firstly yij is an outcome of a random variable. Evenobservations obtained for the same value of x produce different values of y. Secondly themodel we fit is not exactly true.

How could we distinguish between the random variation and the lack of fit? We need morethan one observation at xi to be able to do it.

The difference between observation yij and the mean of the observations taken at the samevalue of x, here xi, i.e.

yij − yimeasures the random variation at xi; it is called pure error. The difference between this meanand the fitted value, i.e. yi − yi, measures lack of fit at xi.

45

Using the double index notation we may write the sum of squares for residuals as

SSE =m∑i=1

ni∑j=1

(yij − yi)2.

We can also define the pure error sum of squares as a measure of overall random variation:

SSPE =m∑i=1

ni∑j=1

(yij − yi)2

and the lack of fit sum of squares as a measure of overall lack of fit:

SSLoF =m∑i=1

ni∑j=1

(yi − yi)2

=m∑i=1

ni(yi − yi)2.

Theorem 2.7. In the simple linear regression model we have

SSE = SSLoF + SSPE.

Proof.

SSE =m∑i=1

ni∑j=1

(yij − yi)2

=m∑i=1

ni∑j=1

{(yij − yi) + (yi − yi)}2

=m∑i=1

ni∑j=1

(yij − yi)2 +m∑i=1

ni(yi − yi)2 + 2m∑i=1

ni∑j=1

(yij − yi)(yi − yi)

= SSPE + SSLoF + 2m∑i=1

(yi − yi)ni∑j=1

(yij − yi)

= SSPE + SSLoF

since∑ni

j=1(yij − yi) = 0. �

This theorem shows how the residual sum of squares is split into two parts, one due to the pureerror and one due to the model lack of fit. To work out the split of the degrees of freedom,note that to calculate SSPE we must calculatem sample means yi, i = 1, . . . ,m. Each samplemean takes up one degree of freedom. Thus the degrees of freedom for pure error are n−m.By subtraction, the degrees of freedom for lack of fit are

(n− 2)− (n−m) = m− 2.

We will see later thatE[SSPE] = (n−m)σ2

46

whether the simple linear regression model is true or not.

It can also be shown that if the simple linear regression model is true then

E[SSLoF ] = (m− 2)σ2.

Hence both MSPE and MSLoF give us unbiased estimators of σ2, but the latter one only ifthe model is true.

LetH0 : simple linear regression model is true, andH1 : simple linear regression model is not true.

then under H0

(m− 2)MSLoFσ2

∼ χ2m−2.

Also(n−m)MSPE

σ2∼ χ2

n−m

whatever the model.

Hence, underH0, the ratio of these two statistics divided by the respective degrees of freedomis distributed as Fm−2

n−m , namely

F =MSLoFMSPE

∼ Fm−2n−m .

These calculations can be set out in an Analysis of Variance table.

Source of variation d.f. SS MS VRRegression 1 SSR MSR

MSRMSE

Residual n− 2 SSE MSE = SSEn−2

Lack of fit m− 2 SSLoF MSLoF=SSLoFm−2

MSLoFMSPE

Pure Error n−m SSPE MSPE=SSPEn−m

Total n− 1 SST

Table 2.5: ANOVA Table

Note that we can only do this lack of fit test if we have replications. These have to be truereplications, not just repeated measurements on the same sampling unit.

Example 2.5. (Plasma data continued)

To illustrate these ideas we return to the plasma example.

Within R we can get the pure error sum of squares by treating x as a factor rather than a re-gressor variable. This is what we would do if our values of x represented different treatmentsbut the values, 0 to 4 here, didn’t have any real meaning.

We use the commands

47

> mody3<-lm(y~factor(x))>anova(mody3)



factor(x) 4 243.578 60.895 12.487 2.974e-05 ***Residuals 20 97.536 4.877---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The line for the Pure Error is just that for Residuals in this model. So the Pure Error Sum ofSquares is 97.54 and we can write the overall Anova table as

Analysis of Variance

Source DF SS MS F PRegression 1 224.51 224.51 44.28 0.000Residual Error 23 116.61 5.07

Lack of Fit 3 19.07 6.36 1.30 0.301Pure Error 20 97.54 4.88

Total 24 341.11

In fact the P value for the lack of fit test is 0.301, so there is no evidence against the nullhypothesis that the data fits satisfactorily. We may think the residual plot shows some evidencethat a transformation is necessary but this test does not really support that!

The analysis of variance table for the plasma data taking the log transformation is as follows

>mody4<-lm(ly~factor(x))>anova(mody4)Analysis of Variance Table

Response: lyDf Sum Sq Mean Sq F value Pr(>F)

factor(x) 4 2.74384 0.68596 14.931 8.382e-06 ***Residuals 20 0.91883 0.04594

So in this case the full Anova table is

Analysis of Variance

Source DF SS MS F P

48

Regression 1 2.6554 2.6554 60.63 0.000Residual 23 1.0073 0.0438

Lack of Fit 3 0.0885 0.0295 0.64 0.597Pure Error 20 0.9188 0.0459

Total 24 3.6627

The P value is 0.597 so there is again no reason to doubt the fit of this model. Overall I wouldprefer the transformed model based on the residual plots.

This example emphasises that we should take account of all the information we have, residualplots, analysis of variance table, tests about the individual parameters, outliers and infuentialobservations etc., in deciding on our final model.

2.8 Matrix approach to simple linear regression

In this section I will discuss a matrix approach to fitting simple linear regression models. Iwill use various results which will be discussed more fully in the next chapter.

Any set of equations can be re-written in matrix and vector form. For the simple linearregression model we have n equations.

y1 = β0 + β1x1 + ε1

y2 = β0 + β1x2 + ε2

......

yn = β0 + β1xn + εn

We can write this in matrix formulation as

Y = Xβ + ε (2.23)

where Y is an (n×1) vector of observations,X is an (n×2) matrix called the design matrix,β is a (2× 1) vector of unknown parameters and ε is an (n× 1) vector of errors. Here

X =

1 x1

1 x2...

...1 xn

β =

[β0

β1

].

This formulation (2.23) is usually called the General Linear Model.

2.8.1 Vectors of random variables

Vectors Y and ε in equation (2.23) are random vectors as their elements are random variables.Below we show some properties of random vectors.

49

Definition 2.3. The expected value of a random vector is the vector of the respective expectedvalues. That is, for a random vector z = (z1, . . . , zn)T we write

E(z) = E

z1

z2...zn

=

E(z1)E(z2)

...E(zn)

(2.24)

�

We have analogous properties of the expectation for random vectors as for single randomvariables. Namely, for a random vector z, a constant scalar a, a constant vector b and formatrices of constantsA andB we have

E(az + b) = aE(z) + b

E(Az) = AE(z)

E(zTB) = E(z)TB

(2.25)

Variances and covariances of the random variables zi are put together to form the so calledvariance-covariance (dispersion) matrix,

V ar(z) =

var(z1) cov(z1, z2) · · · cov(z1, zn)

var(z2) · · · cov(z2, zn)...

...cov(zn, z1) · · · var(zn)

(2.26)

The dispersion matrix has the following properties.

(a) The matrix Var(z) is symmetric since cov(zi, zj) = cov(zj, zi).

(b) For mutually uncorrelated random variables the matrix is diagonal, since cov(zi, zj) = 0for all i 6= j.

(c) The var-cov matrix can be expressed as

Var(z) = E[(z − E z)(z − E z)T]

(d) The dispersion matrix of a transformed variable u = Az is

Var(u) = AVar(z)AT

Proof. Denote by µ = (µ1, . . . , µn)T = (E z1, . . . ,E zn)T.To see (c) write

E[(z − µ)(z − µ)T] = E

z1 − µ1

...zn − µn

(z1 − µ1, . . . , zn − µn)

=

E(z1 − µ1)2 E[(z1 − µ1)(z2 − µ2)] · · · E[(z1 − µ1)(zn − µn)]

E(z2 − µ2)2 · · · E[(z2 − µ2)(zn − µn)]...

...E[(zn − µn)(z1 − µ1)] · · · E(zn − µn)2

= Var(z).

50

To show (d) we can use the notation of (c),

Var(u) = E[(u− Eu)(u− Eu)T]

= E[(Az −Aµ)(Az −Aµ)T]

= E[A(z − µ)(z − µ)TAT]

= AE[(z − µ)(z − µ)T]AT

= AVar(z)AT.

�

Note that the property (c) gives the expression for the dispersion matrix of a random vectoranalogous to the expression for the variance of a single random variable, that is

Var(z) = E(zzT)− µµT. (2.27)

2.8.2 Multivariate Normal Distribution

In Probability and Statistics II we defined a bivariate normal distribution. We can extend thisidea to more than two random variables.

A random vector z has a multivariate normal distribution if its p.d.f. can be written as

f(z) =1

(2π)n2

√det(V )

exp

{−1

2(z − µ)TV −1(z − µ)

}, (2.28)

where µ is the mean and V is the variance-covariance matrix of z.

We use the notation z ∼ Nn(µ,V ) for such a distribution.

2.8.3 Least squares estimation

For the general linear model (2.23) the normal equations are given by

XTy = XTXβ.

It follows that so long as XTX is invertible, i.e. its determinant is non-zero, the uniquesolution to the normal equations is given by

β = (XTX)−1XTy

For the simple linear regression model

X =

1 x1

1 x2...

...1 xn

51

XTy =

[1 1 · · · 1x1 x2 · · · xn

]y1

y2...yn

=

[ ∑yi∑xiyi

]

XTX =

[n

∑xi∑

xi∑x2i

]The determinant ofXTX is given by

|XTX| = n∑

x2i − (

∑xi)

2 = nSxx.

Hence the inverse ofXTX is

(XTX)−1 =1

nSxx

[ ∑x2i −

∑xi

−∑xi n

]=

1

Sxx

[(∑x2i )/n −x−x 1

].

So the solution to the normal equations is given by

β = (XTX)−1XTy

=1

Sxx

[(∑x2i )/n −x−x 1

] [ ∑yi∑xiyi

]=

1

Sxx

[(∑x2i

∑yi)/n− x

∑xiyi∑

xiyi − x∑yi

]=

1

Sxx

[(∑x2i

∑yi)/n− (x)2

∑yi − x(

∑xiyi − x

∑yi)

Sxy

]=

1

Sxx

[Sxxy − xSxy

Sxy

]=

[y − β1x

β1

]

which is the same result as we obtained before.

The fitted values are

µi = xTi β =(

1 xi)( β0

β1

)= β0 + β1xi.

52

The residual sum of squares is

SSE = yTy − βTXTy

=∑

y2i −

(β0 β1

)( ∑yi∑xiyi

)=

∑y2i − (y − β1x)

∑yi − β1(

∑xiyi)

=∑

y2i − y

∑yi − β1(

∑xiyi − x

∑yi)

= Syy − β1Sxy

= Syy −[Sxy]

2

Sxx

Theorem 2.8. The least squares estimator β is an unbiased estimate of β.

Proof.

E[β] = E[(XTX)−1XTY ]

= (XTX)−1XT E[Y ]

= (XTX)−1XTXβ

= β

�

Theorem 2.9. Var[β] = σ2(XTX)−1.

Proof. We have that β = Ay where A = (XTX)−1XT . Using the result for var(Ay) wehave

Var[β] = (XTX)−1XT var(y)X(XTX)−1

= σ2(XTX)−1XTIX(XTX)−1

= σ2(XTX)−1.

An alternative proof is as follows: First note that Var[Y ] = E[Y Y T ] − E[Y ] E[Y T ] andhence

E[Y Y T ] = Var[Y ] + E[Y ] E[Y T ]

= σ2I +XββTXT .

53

Now

Var[β]

= E[ββT

]− E[β] E[βT

]

= E[(XTX)−1XTY Y TX(XTX)−1]− ββT

= (XTX)−1XT E[Y Y T ]X(XTX)−1 − ββT

= (XTX)−1XT (σ2I +XββTXT )X(XTX)−1 − ββT

= σ2(XTX)−1XTX(XTX)−1

+(XTX)−1XTXββTXTX(XTX)−1 − ββT

= σ2(XTX)−1 + ββT − ββT

= σ2(XTX)−1

�

Theorem 2.10. IfY = Xβ + ε, ε ∼ Nn(0, σ2I)

thenβ ∼ Np(β, σ

2(XTX)−1).

Proof. Each element of β is a linear function of Y1, . . . , Yn. We assume that Yi, i = 1, . . . , nare normally distributed. Hence β is also normally distributed. We obtained its mean andvariance in the previous two theorems. �

Remark 2.7. The vector of fitted values is given by

µ = Y = Xβ

= X(XTX)−1XTY

= HY .

The matrixH = X(XTX)−1XT is called the hat matrix.

Note thatHT = H

and also

HH = X(XTX)−1XTX(XTX)−1︸︷︷︸=I

XT

= X(XTX)−1XT

= H .

Such a matrix is called an idempotent matrix, when it satisfies the conditionAA = A.

We now prove some results about the residual vector

e = Y − Y= Y −HY= (I −H)Y .

54

Lemma 2.2. E(e) = 0.

Proof.

E(e) = (I −H) E(y)

= (I −X(XTX)−1XT )Xb

= Xb−Xb= 0

�

Lemma 2.3. var(e) = σ2(I −H).

Proof.

var(e) = (I −H) var(y)(I −H)T

= σ2(I −H)2

= σ2(I −H −H +H2)

= σ2(I −H)

Lemma 2.4. The sum of squares of the residuals is Y T (I −H)Y .

Proof.

eTe = Y T (I −H)T (I −H)Y

= Y T (I −H −H +HH)Y

= Y T (I −H)Y

Lemma 2.5. The elements of the residual vector e sum to zero, i.en∑i=1

ei = 0.

Note we saw this in equation 2.9.

Proof. We will prove this by contradiction.

Assume that∑ei = nc where c 6= 0. Then∑

e2i =

∑{(ei − c) + c}2

=∑

(ei − c)2 + 2c∑

(ei − c) + nc2

=∑

(ei − c)2 + 2c(∑

ei︸︷︷︸=nc

−nc) + nc2

=∑

(ei − c)2 + nc2

>∑

(ei − c)2.

55

But we know that∑e2i is the minimum value of S(β) so there cannot exist values with a

smaller sum of squares and this gives the required contradiction. So c = 0. �

Corrolary 2.3.1

n

n∑i=1

Yi = Y .

Proof. The error ei = Yi− Yi, so∑ei =

∑(Yi− Yi) but

∑ei = 0. Hence

∑Yi =

∑Yi and

so the result follows. �

We can check that the theorems above give us the same results as we found for simple linearregression. We have seen on page 52 that

(XTX)−1 =1

nSxx

[ ∑x2i −

∑xi

−∑xi n

].

Now, by Theorem 2.9, var[β] = σ2(XTX)−1. Thus

var[β0] =

∑x2iσ

2

nSxx

which, by writing∑x2 =

∑x2 − nx2 + nx2, can be written as σ2

{1n

+ x2

Sxx

}as before.

Also

cov[β0, β1] = σ2

[−∑xi

nSxx

]=−σ2x

Sxx,

and

var[β1] =σ2

Sxx.

The quantity vj is given by

vj = xTj (XTX)−1xj

= (1, xj)1

nSxx

[ ∑x2i −

∑xi

−∑xi n

](1xj

).

We shall leave it as an exercise to show that this simplifies to

vj =1

n+

(xj − x)2

Sxx.

The centred simple linear regression model

Yi = α + β(xi − x) + εi for i = 1, . . . , n

56

can be written as a General Linear Model with

X =

1 (x1 − x)1 (x2 − x)...

...1 (xn − x)

β =

[αβ

].

Using the results from this section it follows that

α = y, β =SxySxx

and

var[α] =σ2

n, cov[α, β] = 0, var[β] =

σ2

Sxx.

The fact that α and β are uncorrelated can make this a useful model to use.

2.9 Matrix approach to simple linear regression

2.9.1 Finding maximum likelihood estimates

Suppose we have a random sample of continuous observations Y1, Y2, . . . , Yn from a distribu-tion with density f(y|θ) where θ is an unknown parameter. Remember a random sample hasindependent observations all with the same distribution. We want to estimate θ. We definethe likelihood function L(θ,y), where y = (y1, y2, . . . , yn), by

L(θ,y) =n∏i=1

f(yi|θ).

The likelihood is the probability of obtaining the data given the value of θ. The likelihood isregarded as a function of θ.

If the observations are discrete with probability function Pr(Y = y|θ), then we define thelikelihood function by

L(θ,y) =n∏i=1

Pr(Yi = yi|θ).

The maximum likelihood estimator is the value of θ, θ say, that makes L(θ,y) a maximum.

Example 2.6. As an example, suppose Y1, Y2, . . . , Yn is a random sample from an exponentialdistribution with density function

f(y|θ) = θ exp(−θy).

Then the likelihood is

L(θ,y) =∏

f(yi|θ)

= θn exp(−θ∑

yi).

57

To find the Maximum likelihood Estimator (MLE) of θ it is usually more convenient to findthe maximum of loge L. This will give the same answer because the logarithm function isorder preserving. In this case

logL = n log θ − θ∑

yi.

Differentiating and setting equal to zero

d logL

dθ=n

θ−∑

yi

which implies thatθ =

n∑yi.

If we differentiate againd2 logL

dθ2= − n

θ2< 0

so we do have a maximum.

The equationd logL

dθ= 0

is called the likelihood equation with solution θ = θ.

Example 2.7. As another example consider a random sample of size n from a Poisson distri-bution with mean θ. We know that

Pr(Y = y|θ) =e−θθy

y!

so that the likelihood function is

L(θ,y) =e−nθθ

∑yi∏

yi!.

ThuslogL =

∑yi log θ − nθ − log

∏yi!.

The likelihood equation is ∑yiθ− n = 0

with solution

θ =

∑yin

= y.

Note the second derivative is indeed negative.

58

2.9.2 Properties of maximum likelihood estimators

Maximum likelihood estimators have good asymptotic properties, i.e. as n→∞.

• Asymptotically they are unbiased,

• they are normally distributed

• they achieve the minimum variance possible.

However, for small samples they can be biased and there is no guarantee that the good asymp-totic properties hold even approximately for small samples.

2.9.3 Relationship with least squares estimates

Consider a random sample of size n from a normal distribution with mean θ and variance σ2.We wish to find the MLE of θ.

f(y|θ) =1

σ√

2πexp

(− 1

2σ2(y − θ)2

)L(θ,y) =

1

σn(2πn/2exp

(− 1

2σ2

n∑i=1

(yi − θ)2

)

logL = − log(σn(2πn/2)− 1

2σ2

n∑i=1

(yi − θ)2

dL

dθ=

∑(yi − θσ2

Setting this equal to zero and solving we see that θ = y. We see from this that to find theMLE we had to maximise

− log(σn(2πn/2)− 1

2σ2

n∑i=1

(yi − θ)2

or equivalently to minimise∑

(yi − θ)2.

Suppose now we have a simple linear regression

Yi ∼ N(β0 + β1xi, σ2).

The likelihood as a function of β0 and β1 is

L(β0, β1,y) =1

σn(2πn/2exp

(− 1

2σ2

n∑i=1

{yi − (β0 + β1xi)}2

).

To maximise logL with respect to β0 and β1 we have to maximise

−n∑i=1

{yi − (β0 + β1xi)}2

59

or equivalently minimisen∑i=1

{yi − (β0 + β1xi)}2.

But this is exactly the minimisation we caried out to find the least squares estimates. So ifour observations are normally distributed and we make the usual assumptions that they areindependent with a common variance then the least squares estimates are the same as themaximum liklihood estimates. This holds for all the linear models we look at, not just simplelinear regression.

2.10 Exercise

Here we give an exercise for a simple linear regression model. Try to fill in all the gaps in theoutput, they are denoted by −−−−−Measurements of the annual cost of up-keeping an American house in dollars (y) and thevalue of the house in thousands of dollars (x) were collected for a sample of 20 houses and asimple linear regression model was fitted.

The regression equation isy = -174 + 7.15 x

Predictor Coef SE Coef T PConstant -174.50 74.00 -2.36 0.030x 7.1494 0.7623 --- 0.000

S = --- R-Sq = ---% R-Sq(adj) = 82.1%

Analysis of VarianceSource DF SS MS F PRegression 1 591102 ------ ----- 0.000Residual Error 18 ------ 6720Total -- 712065

Unusual ObservationsObs x y Fit SE Fit Residual St Resid12 116 813.6 ----- 24.6 162.3 2.08RR denotes an observation with a large standardized residual.

60

3 Multiple Regression

3.1 Example - Dwine data

The Dwine studios inc. operates studios in 21 cities of medium size. These studios specializedin portraits of children. The company is considering an expansion into other similar cities andwishes to investigate whether sales (y) at a studio can be predicted from the number of personsaged 16 or younger (x1) and the per capita disposable personal income (x2).

Initially we investigate the relationship between the sales and the number of persons aged 16or younger, thus between y and x1.

> dwine <- read.csv("Dwine.csv")> # define the variables of interest> x1 <- dwine[,2]> x2 <- dwine[,3]> y <- dwine[,4]> # define the first regression> mod1 <- lm(y ~ x1)> summary(mod1)

Call:lm(formula = y ~ x1)


-19.403 -6.121 -0.311 4.228 27.452


(Intercept) 68.0454 9.4622 7.191 7.86e-07 ***x1 1.8359 0.1464 12.539 1.23e-10 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


> # plot the fitted model> plot(x1,y,main ="Fitted line model 1")> abline(mod1)

> anova(mod1)Analysis of Variance Table

61

40 50 60 70 80 90

140

160

180

200

220

240

Fitted line model 1

x1

y

Figure 3.1: Fitted line for regression model between y and x1.


x1 1 23371.8 23371.8 157.22 1.229e-10 ***Residuals 19 2824.4 148.7---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> stdres1 <- rstandard(mod1)> fits1 <- fitted(mod1)> plot(fits1,stdres1, main = "Std res vs fit, Model 1")

140 160 180 200 220

−1

01

2

Std res vs fit, Model 1

fits1

std

res1

Figure 3.2: Standard residuals versus fitted values for regression model between y and x1.

> qqnorm(stdres1, main = "Q-Q Plot")> qqline(stdres1)

> shapiro.test(stdres1)

Shapiro-Wilk normality test

62

−2 −1 0 1 2−

10

12

Q−Q Plot


Sam

ple

Quantile

s

Figure 3.3: Q-Q plot for the standardized residuals for regression model between y and x1.

data: stdres1W = 0.96526, p-value = 0.6276

In conclusion for the first regression model, we have a good fit R2 = 89.2%, the slope of theline is highly significant and there are no major problems shown by the residual plots.

In the next part, we look at the relation between the sales and the per capita disposable per-sonal income, thus between y and x2.

> mod2 <- lm(y ~ x2)> summary(mod2)



-41.290 -11.630 -5.203 10.531 42.079


(Intercept) -352.493 80.657 -4.370 0.000329 ***x2 31.173 4.698 6.636 2.39e-06 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


>> plot(x2,y,main ="Fitted line model 2")> abline(mod2)

63

16.0 16.5 17.0 17.5 18.0 18.5 19.0140

180

220

Fitted line model 2

x2

y

Figure 3.4: Fitted line for regression model between y and x2.

>> anova(mod2)Analysis of Variance Table


x2 1 18299.8 18299.8 44.032 2.391e-06 ***Residuals 19 7896.4 415.6---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1> stdres2 <- rstandard(mod2)> fits2 <- fitted(mod2)> plot(fits2,stdres2, main = "Std res vs fit, Model 2")

140 160 180 200 220 240

−2

−1

01

2


fits2

std

res2

Figure 3.5: Standard residuals versus fitted values for regression model between y and x2.




64

−2 −1 0 1 2−

2−

10

12

Q−Q Plot


Sam

ple

Quantile

s

Figure 3.6: Q-Q plot for the standardized residuals for regression model between y and x2.


For this model, the dependence of sales on income is highly significant although the R2 valueis not as high as for the first regression model. The residual plots show no major problems.

In conclusion, we try to take account of the effect of both the numbers of children and theincome on the sales. Thus in this scenario, we are running a regression model of y on x1 andx2 jointly.

> mod3 <- lm(y ~ x1 + x2)> summary(mod3)

Call:lm(formula = y ~ x1 + x2)


-18.4239 -6.2161 0.7449 9.4356 20.2151


(Intercept) -68.8571 60.0170 -1.147 0.2663x1 1.4546 0.2118 6.868 2e-06 ***x2 9.3655 4.0640 2.305 0.0333 *---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


> anova(mod3)Analysis of Variance Table

65


x1 1 23371.8 23371.8 192.8962 4.64e-11 ***x2 1 643.5 643.5 5.3108 0.03332 *Residuals 18 2180.9 121.2---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> stdres3 <- rstandard(mod3)> fits3 <- fitted(mod3)> plot(fits3,stdres3, main = "Std res vs fit, Model 3")

We see that the overallR2 = 91.7% is not very much higher than for x1 alone. The coefficientfor x2 is significant at 5% level but not highly significant. It seems that adding x2 to the modelmakes a modest improvement.

140 160 180 200 220

−1

01

2


fits3

std

res3

Figure 3.7: Standard residuals versus fitted values for regression model between y, x1 and x2.


−2 −1 0 1 2

−1

01

2

Q−Q Plot


Sam

ple

Quantile

s

Figure 3.8: Q-Q plot for the standardized residuals for regression model between y, x1 andx2.


66



To see why the improvement is only modest, we plot x1 versus x2 and we see that they arerelated. In fact, small values of each tend to go together as do large values. Thus as they arecorrelated there is not very much effect on the model if we add x2.

40 50 60 70 80 90

16.0

17.5

19.0

x1 vs x2

x1

x2

Figure 3.9: Plot between x1 and x2.

3.2 Multiple Linear Regression Model

A fitted linear regression model always leaves some residual variation. There might be anothersystematic cause for the variability in the observations yi. If we have data on other explanatoryvariables we can ask whether they can be used to explain some of the residual variation in Y .If this is a case, we should take it into account in the model, so that the errors are purelyrandom. We could write

yi = β0 + β1xi + β2zi + ε?i︸︷︷︸previously εi

.

Z is another explanatory variable. Usually, we denote all explanatory variables (there maybe more than two of them) using letter X with an index to distinguish between them, i.e.,X1, X2, . . . , Xp−1.

A Multiple Linear Regression (MLR) model for a response variable Y and explanatory vari-ables X1, X2, . . . , Xp−1 is

E(Yi|X1 = x1i, . . . , Xp−1 = xp−1i) = β0 + β1x1i + . . .+ βp−1xp−1i

var(Yi|X1 = x1i, . . . , Xp−1 = xp−1i) = σ2, i = 1, . . . , n

cov(Yi|X1 = x1i, .., Xp−1 = xp−1i, Yj|X1 = x1j, .., Xp−1 = xp−1j) = 0, i 6= j

As in the SLR model we denote

yi = Yi|X1 = x1i, . . . , Xp−1 = xp−1i

67

and we usually omit the condition on Xs and write

µi = E(yi) = β0 + β1x1i + . . .+ βp−1xp−1i

var(yi) = σ2, i = 1, . . . , n

cov(yi, yj) = 0, i 6= j

oryi = β0 + β1x1i + . . .+ βp−1xp−1i + εi

var(εi) = σ2, i = 1, . . . , n

cov(εi, εj) = 0, i 6= j

For testing we need the assumption of Normality, i.e., we assume that

yi ∼ind

N(µi, σ2)

orεi ∼

indN(0, σ2)

To simplify the notation we write the MLR model in a matrix form as follows.y1

y2...yn

︸︷︷︸

:= y

=

1 x11 · · · xp−11

1 x12 · · · xp−12...

... · · · ...1 x1n · · · xp−1n

︸︷︷︸

:= X

β0

β1...

βp−1

︸︷︷︸

:= β

+

ε1

ε2...εn

︸︷︷︸

:= ε

So, using the matrix notation we may write the model as

y = Xβ + ε, (3.1)

where y is the vector of responses, X is the design matrix, β is the vector of unknown, con-stant parameters and ε is the vector of random errors.

3.3 Least squares estimation

We have given most of the results in this section in Chapter 2 for simple linear regression. Werepeat them here to emphasise that they also hold for multiple regression.

To derive the least squares estimator (LSE) for the parameter vector β we minimise the sumof squares of the errors or residuals, that is

S(β) =n∑i=1

[Yi − {β0 + β1x1i + · · ·+ βp−1xp−1i}]2

=∑

ε2i

= εTε

= (Y −Xβ)T (Y −Xβ)

= (Y T − βTXT )(Y −Xβ)

= Y TY − Y TXβ − βTXTY + βTXTXβ

= Y TY − 2βTXTY + βTXTXβ.

68

Theorem 3.1. The LSE β of β is given by

β = (XTX)−1XTY

ifXTX is non-singular. IfXTX is singular there is no unique LSE of β.

Proof. Let β0 be any solution ofXTXβ = XTY . then

S(β)− S(β0)

= Y TY − 2βTXTY + βTXTXβ − Y TY + 2βT0XTY + βT0X

TXβ0

= Y TY − 2βTXTXβ0 + βTXTXβ − Y TY + 2βT0XTXβ0 − βT0XTXβ0

= βTXTXβ − 2βTXTXβ0 + βT0XTXβ0

= βTXTXβ − βTXTXβ0 − βTXTXβ0 + βT0XTXβ0

= βTXTXβ − βTXTXβ0 − βT0XTXβ + βT0XTXβ0

= βT (XTXβ −XTXβ0)− βT0 (XTXβ −XTXβ0)

= (βT − βT0 )(XTXβ −XTXβ0)

= (βT − βT0 )XTX(β − β0)

= {X(β − β0)}T{X(β − β0)} ≥ 0

since it is a sum of squares of elements of the vectorX(β − β0).

So we have shown that S(β)− S(β0) ≥ 0.

Hence, β0 minimises S(β), i.e. any solution ofXTXβ = XTY minimises S(β).

IfXTX is nonsingular the unique solution is β = (XTX)−1XTY .

IfXTX is singular there is no unique solution. �

We proved the following three theorems in Chapter 2 which show the properties of the LSEof β, β.

Theorem 3.2. The LSE β is an unbiased estimator of β.

Theorem 3.3. Var[β] = σ2(XTX)−1.

Proof. An alternative proof to the one we gave before is as follows: First note that Var[Y ] =E[Y Y T ]− E[Y ] E[Y T ] and hence

E[Y Y T ] = Var[Y ] + E[Y ] E[Y T ]

= σ2I +XββTXT .

69

Now

Var[β]

= E[ββT

]− E[β] E[βT

]

= E[(XTX)−1XTY Y TX(XTX)−1]− ββT

= (XTX)−1XT E[Y Y T ]X(XTX)−1 − ββT

= (XTX)−1XT (σ2I +XββTXT )X(XTX)−1 − ββT

= σ2(XTX)−1XTX(XTX)−1

+(XTX)−1XTXββTXTX(XTX)−1 − ββT

= σ2(XTX)−1 + ββT − ββT

= σ2(XTX)−1

�

Theorem 3.4. IfY = Xβ + ε, ε ∼ Nn(0, σ2I)

thenβ ∼ Np(β, σ

2(XTX)−1).

Proof. Each element of β is a linear function of Y1, . . . , Yn. We assume that Yi, i = 1, . . . , nare normally distributed. Hence β is also normally distributed. We obtained its mean andvariance in the previous two theorems. �

Remark 3.1. The vector of fitted values is given by

µ = Y = Xβ

= X(XTX)−1XTY

= HY .

The matrixH = X(XTX)−1XT is called the hat matrix.

Recall thatHT = H

and also

HH = X(XTX)−1XTX(XTX)−1︸︷︷︸=I

XT

= X(XTX)−1XT

= H .

Such a matrix is called an idempotent matrix, when it satisfies the conditionAA = A.

We now prove some results about the residual vector

e = Y − Y= Y −HY= (I −H)Y .

70

Lemma 3.1. E(e) = 0.

Proof.

E(e) = (I −H) E(y)

= (I −X(XTX)−1XT )Xb

= Xb−Xb= 0

�

Lemma 3.2. var(e) = σ2(I −H).

Proof.

var(e) = (I −H) var(y)(I −H)T

= σ2(I −H)2

= σ2(I −H −H +H2)

= σ2(I −H)

�

Lemma 3.3. The sum of squares of the residuals is Y T (I −H)Y .

Proof.

eTe = Y T (I −H)T (I −H)Y

= Y T (I −H −H +HH)Y

= Y T (I −H)Y

�

Lemma 3.4. The elements of the residual vector e sum to zero, i.en∑i=1

ei = 0.

Proof. We will prove this by contradiction.

Assume that∑ei = nc where c 6= 0. Then∑

e2i =

∑{(ei − c) + c}2

=∑

(ei − c)2 + 2c∑

(ei − c) + nc2

=∑

(ei − c)2 + 2c(∑

ei︸︷︷︸=nc

−nc) + nc2

=∑

(ei − c)2 + nc2

>∑

(ei − c)2.

71

But we know that∑e2i is the minimum value of S(β) so there cannot exist values with a

smaller sum of squares and this gives the required contradiction. So c = 0. �

Corrolary 3.1.1

n

n∑i=1

Yi = Y .

Proof. The error ei = Yi− Yi, so∑ei =

∑(Yi− Yi) but

∑ei = 0. Hence

∑Yi =

∑Yi and

so the result follows. �

3.4 Analysis of Variance

We begin this section by proving the basic Analysis of Variance identity.

Theorem 3.5. The total sum of squares splits into the regression sum of squares and theresidual sum of squares, that is

SST = SSR + SSE.

Proof.

SST =∑

(Yi − Y )2

=∑

Y 2i − nY 2

= Y TY − nY 2.

SSR =∑

(Yi − Y )2

=∑

Y 2i − 2Y

∑Yi︸︷︷︸

=nY

+nY 2

=∑

Y 2i − nY 2

= YTY − nY 2

= βTXTXβ − nY 2

= Y TX(XTX)−1XTX(XTX)−1︸︷︷︸=I

XTY − nY 2

= Y THY − nY 2.

Note that βTXTXβ could also be expressed as β

TXTY using the normal equations so that

another way of writing SSR is as βTXTY − nY 2.

We have seen thatSSE = Y T (I −H)Y

72

and so

SSR + SSE = Y THY − nY 2 + Y T (I −H)Y

= Y TY − nY 2

= SST

�

3.4.1 F-test for the Overall Significance of Regression

Suppose we wish to test the hypothesis

H0 : β1 = β2 = . . . = βp−1 = 0

(i.e. all coefficients except β0 are zero) versus

H1 : at least one of the coefficients is non-zero.

Under H0, the model reduces to the null model

Y = 1β0 + ε

, i.e. in testing H0 we are asking if there is sufficient evidence to reject the null model.

Consider the statistic F ? defined by

F ? =(regression SS)/(p− 1)

s2

where s2 = Smin/(n− p) and so

F ? =unbiased estimate of σ2 (if H0 is true)unbiased estimate of σ2 (always true)

.

Thus if H0 is true, F ? ≈ 1 and large values of F ? indicate departures from H0. Under H0, F ?

has an F distribution with degrees of freedom p− 1 and n− p and a test at the α level for H0

is given by rejecting H0 (ie “the overall regression is significant”) if

F ? > F p−1n−p(α)

The Analysis of variance table is given by

Source d.f. SS MS= SSd.f. V R

Overall regression p− 1 Y THY − nY 2 SSRp− 1

SSp− 1

/s2

Residual n− p Y T (I −H)Y s2

Total n− 1 Y TY − nY 2

73

3.5 Inferences about the parameters

In Theorem 3.4 we have seen that

β ∼ Np(β, σ2(XTX)−1).

Thereforeβj ∼ N(βj, σ

2cjj), j = 0, 1, 2, . . . , p− 1

where cjj is the jth diagonal element of (XTX)−1 (counting from 0 to p − 1). Hence, it isstraightforward to make inferences about βj , in the usual way.

A 100(1− α)% confidence interval for βj is

βj ± tn−p(α

2

)√S2cjj

and the test statistic for testing H0 : βj = 0 versus H1 : βj 6= 0 is

T =βj√S2cjj

∼ tn−p if H0 is true.

Care is needed in interpreting the confidence intervals and tests. They refer only to the modelwe are currently fitting. Thus acceptingH0 : βj = 0 does not mean thatXj has no explanatorypower; it means that, conditional onX1, . . . , Xj−1, Xj+1, . . . , Xp−1 being in the modelXj hasno additional power. It is often best to think of the test as comparing models without and withXj , i.e.

H0 : E(yi) = β0 + β1X1i + · · ·+ βj−1Xj−1i + βj+1Xj+1i + · · ·+ βp−1Xp−1,i

versusH1 : E(yi) = β0 + β1X1i + · · ·+ βp−1Xp−1,i.

It does not tell us anything about the comparison between models E(yi) = β0 and E(yi) =β0 + βjXji.

3.6 Confidence interval for µ

We haveE(Y ) = µ = Xβ.

As with simple linear regression, we might want to estimate µ at a specific x, say x0 =(1, x10, . . . , xp−1,0)T , i.e.

µ0 = E(Y |X1 = x10, . . . , Xp−1 = xp−1,0).

The point estimate will beµ0 = xT0 β.

Assuming normality, as usual, we can obtain a confidence interval for µ0.

74

Theorem 3.6.µ0 ∼ N(µ0, σ

2xT0 (XTX)−1x0).

Proof. (i) µ0 = xT0 β is a linear combination of β0, β1, . . . , βp−1, each of which is normal.Hence µ0 is also normal.

(ii)

E(µ0) = E(xT0 β)

= xT0 E(β)

= xT0 β

= µ0

(iii)

Var(µ0) = Var(xT0 β)

= xT0 Var(β)x0

= σ2xT0 (XTX)−1x0).

�

The following corollary is an immediate consequence.

Corrolary 3.2. A 100(1− α)% confidence interval for µ0 is

µ0 ± tn−p(α

2

)√S2xT0 (XTX)−1x0.

3.7 Predicting a new observation

To predict a new observation we need to take into account not only its expectation, but also apossible new random error.

The point estimator of a new observation

Y0 = Y |X1 = x10, . . . , Xp−1 = xp−1,0 = µ0 + ε0

isY0 = xT0 β(= µ0),

which, assuming normality, is such that

Y0 ∼ N(µ0, σ2xT0 (XTX)−1x0).

ThenY0 − µ0 ∼ N(0, σ2xT0 (XTX)−1x0).

soY0 − (µ0 + ε0) ∼ N(0, σ2xT0 (XTX)−1x0 + σ2).

75

that isY0 − Y0 ∼ N(0, σ2{1 + xT0 (XTX)−1x0}).

and henceY0 − Y0√

σ2{1 + xT0 (XTX)−1x0}∼ N(0, 1).

As usual we estimate σ2 by S2 and get

Y0 − Y0√S2{1 + xT0 (XTX)−1x0}

∼ tn−p.

Hence a 100(1− α)% prediction interval for Y0 is given by

Y0 ± tn−p(α

2

)√S2{1 + xT0 (XTX)−1x0}.

3.8 Example on sales data

In this section, we consider an example on sales data. In particular, we define the variable yas the sales in £100.000, while x1 is the promotional expenditure in £1.000; x2 is the numberof active accounts; x3 is the district’s potential assessed by the sales director and x4 is thenumber of competing companies in that district. We observe a small set of data and we startby considering a simple linear regression:

sales<- read.csv("sales.csv")x1<-sales[ ,1]x2<-sales[ ,2]x3<-sales[ ,3]x4<-sales[ ,4]y <-sales[ ,5]> mody <- lm(y~x1)> summary(mody)



-150.138 -39.452 -6.879 41.414 164.748


(Intercept) 146.979 59.251 2.481 0.0276 *x1 4.257 10.715 0.397 0.6976---

76

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 82.5 on 13 degrees of freedomMultiple R-squared: 0.012,Adjusted R-squared: -0.064F-statistic: 0.1579 on 1 and 13 DF, p-value: 0.6976

We see that x1 is not statistically significant (the p-value is around 0.6976) and the R2 is only0.012, thus 1.2%. Moreover looking at the residual plot, we can see that the residual plotsuggests non constant variance.

160 165 170 175 180 185

−2

−1

01

2

Std res versus fits, sales, x1

fits1

std

res1

−1 0 1

−2

−1

01

2

Normal Q−Q Plot


Sam

ple

Quantile

s

Figure 3.10: Standard residuals versus fitted values for regression model between y and x1

(left panel) and Q-Q plot for the standardized residuals (right panel).

> stdres1 <- rstandard(mody)> fits1 <- fitted(mody)> plot(fits1,stdres1, main = "Std res versus fits, sales, x1")> qqnorm(stdres1)> qqline(stdres1)

We also run a Shapiro-Walk normality test for the standardize residuals




From the output, the p-value > 0.05 implying that the distribution of the standardized residualsare not significantly different from normal distribution. In other words, we can assume thenormality.

Next we run a linear regression between y and x2

77

> mody2 <- lm(y~x2)> summary(mody2)



-76.041 -49.709 -7.084 32.275 100.918


(Intercept) -43.317 60.671 -0.714 0.48787x2 4.184 1.158 3.613 0.00315 **---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


We can see that x2 is significant with p-value 0.003 and the R2 is getting better with respectto the first regression, thus moving from 1.2% to 50.1%.

> stdres2 <- rstandard(mody2)> fits2 <- fitted(mody2)> plot(fits2,stdres2, main = "Std res versus fits, sales, x2")> qqnorm(stdres2)> qqline(stdres2)

There are some suggestion of non-constant variance as described from the following figures




From the output, the p-value > 0.05 implying that the distribution of the standardized residualsare not significantly different from normal distribution. In other words, we can assume thenormality.

Next we move to the third regression between y and x3

> mody3 <- lm(y~x3)

78

100 150 200 250

−1.5

−0.5

0.5

1.0

1.5


fits2

std

res2

−1 0 1

−1.5

−0.5

0.5

1.0

1.5

Normal Q−Q Plot


Sam

ple

Quantile

s



> summary(mody3)



-130.373 -42.521 -3.028 34.472 136.272


(Intercept) 119.419 47.543 2.512 0.026 *x3 5.232 4.536 1.153 0.269---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


From this linear regression, we can see that x3 is not statistically significant with p-value0.269 and the R2 is lower 9.28% with respect to the previous regression. On the other handthe residual plots seem correct.


We also run a Shapiro-Walk normality test for the standardize residuals

79

140 150 160 170 180 190 200 210

−1

01

2Std res versus fits, sales, x3

fits3

std

res3

−1 0 1

−1

01

2

Normal Q−Q Plot


Sam

ple

Quantile

s






As stated early, also in this scenario, the p-value is bigger than 0.05. Further we move to aregression between y and x4

> mody4 <- lm(y~x4)> summary(mody4)



-66.266 -37.017 -8.269 30.083 82.235


(Intercept) 396.073 49.248 8.042 2.11e-06 ***x4 -25.051 5.242 -4.779 0.00036 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


80

We can see that x4 is the most significant variable with p-value 0.00036 and the R2 is thehighest, 63.7%, with respect to the other linear regression models. The residual plot suggestsa non-linear term in x4 might be needed:


100 150 200 250

−1.5

−0.5

0.5

1.0

1.5


fits4

std

res4

−1 0 1

−1.5

−0.5

0.5

1.0

1.5

Normal Q−Q Plot


Sam

ple

Quantile

s






Next, we consider the full model with all the four regressors included in the analysis

> modyf <- lm(y~ x1 + x2 + x3 + x4)> summary(modyf)

Call:lm(formula = y ~ x1 + x2 + x3 + x4)


-8.6881 -3.1604 0.4714 2.0541 6.0053

81


(Intercept) 177.2286 8.7874 20.169 1.98e-09 ***x1 2.1702 0.6737 3.221 0.00915 **x2 3.5380 0.1092 32.414 1.84e-11 ***x3 0.2035 0.3189 0.638 0.53760x4 -22.1583 0.5454 -40.630 1.95e-12 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Fitting the full model with all variables the overall regression is highly significant with p-value equal to 1.285e− 12, while x3 is not significant with p-value 0.537. The residual plotsare okay as stated below

> stdresf <- rstandard(modyf)> fitsf <- fitted(modyf)> plot(fitsf,stdresf, main = "Std res versus fits, sales, full")> qqnorm(stdresf)> qqline(stdresf)

50 100 150 200 250 300 350

−2

−1

01

Std res versus fits, sales, full

fitsf

std

resf

−1 0 1

−2

−1

01

Normal Q−Q Plot


Sam

ple

Quantile

s

Figure 3.14: Standard residuals versus fitted values for full regression model (left panel) andQ-Q plot for the standardized residuals (right panel).

Looking at the Shapiro-Wilk test, we have that the p-value is around 0.623:

> shapiro.test(stdresf)

Shapiro-Wilk normality testdata: stdresfW = 0.95604, p-value = 0.6239

82

We decide to run a multiple regression model by not including the non-significant variable x3,thus we have the following regression model:

> modyr <- lm(y~ x1 + x2 + x4)> summary(modyr)

Call:lm(formula = y ~ x1 + x2 + x4)


-7.5414 -3.3514 0.6443 2.3247 6.7618


(Intercept) 178.52062 8.31759 21.46 2.50e-10 ***x1 2.10555 0.64785 3.25 0.00773 **x2 3.56240 0.09945 35.82 9.66e-13 ***x4 -22.18799 0.52855 -41.98 1.71e-13 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.98 on 11 degrees of freedomMultiple R-squared: 0.997,Adjusted R-squared: 0.9961F-statistic: 1200 on 3 and 11 DF, p-value: 4.078e-14

All the terms are highly significant and the R2 is close to 100%. The residuals plot are okayand this model seems highly satisfactory.

> stdresr <- rstandard(modyr)> fitsr <- fitted(modyr)> plot(fitsr,stdresr, main = "Std res versus fits, sales, reduced")> qqnorm(stdresr)> qqline(stdresr)

Then we move to the Shapiro-Wilk test and the results are in line with the previous results:

> shapiro.test(stdresr)Shapiro-Wilk normality test

data: stdresrW = 0.94754, p-value = 0.4866

Now suppose, we had decided to ignore x1 as a possible regressor when we saw the simplelinear regression on x1 was not significant, then we have the following regression

83

50 100 150 200 250 300 350

−1.5

−0.5

0.5

1.5

Std res versus fits, sales, reduced

fitsr

std

resr

−1 0 1

−1.5

−0.5

0.5

1.5

Normal Q−Q Plot


Sam

ple

Quantile

s

Figure 3.15: Standard residuals versus fitted values for reduced regression model (left panel)and Q-Q plot for the standardized residuals (right panel).

> modyr1 <- lm(y~ x2 + x4)> summary(modyr1)



-12.263 -1.397 0.921 1.665 13.573


(Intercept) 189.8256 10.1278 18.74 2.97e-10 ***x2 3.5692 0.1333 26.78 4.52e-12 ***x4 -22.2744 0.7076 -31.48 6.66e-13 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


The terms are significant, but the residual plot is not satisfactory suggesting a non-constantvariance as shown below

> stdresr1 <- rstandard(modyr1)> fitsr1 <- fitted(modyr1)> plot(fitsr1,stdresr1, main = "Std res versus fits, sales, reduced")> qqnorm(stdresr1)> qqline(stdresr1)

84

50 100 150 200 250 300 350

−2

−1

01

2Std res versus fits, sales, reduced 1

fitsr1

std

resr1

−1 0 1

−2

−1

01

2

Normal Q−Q Plot


Sam

ple

Quantile

s

Figure 3.16: Standard residuals versus fitted values for “wrong” reduced regression model(left panel) and Q-Q plot for the standardized residuals (right panel).

Looking at the Q-Q plot and at the Shapiro-Wilk test we have a small value with respect tothe previous results and the Q-Q plot looks odd.

> shapiro.test(stdresr1)


data: stdresr1W = 0.91424, p-value = 0.1572

3.9 Model Building

We have already mentioned the principle of parsimony, we should use the simplest model thatachieves our purpose.

It is easy to get a simple model (Yi = β0 + εi) and it is easy to find a model that describes thedata perfectly (Yi = yi + εi) but the first is generally too simple and the second isn’t a usefulmodel. Achieving both, namely a simple model that describes the data well is something ofan art. Moreover there is often more than one model which does a reasonable job.

The data set Sales discussed in the previous lecture illustrates some of the features that canoccur. On its own variable X1 explains only 1% of the variation but once X2 and X4 areincluded in the model then X1 is significant and also seems to cure problems with normalityand non-constant variance.

3.9.1 F-test for the deletion of a subset of variables

Suppose the overall regression model as tested by the Analysis of Variance table is significant.We know that not all of the β parameters are zero. But we may still be able to delete severalvariables. We can carry out the Subset test based on the extra sum of squares principle. We

85

are asking if we can reduce the set of regressors

x1, x2, . . . , xp−1

to sayx1, x2, . . . , xq−1

(renumbering if necessary) where q < p, by omitting xq, xq+1, . . . , xp−1. Thus we are inter-ested in whether the inclusion of xq, xq+1, . . . , xp−1 in the model provide a significant increasein the overall regression sum of squares or equivalently a significant decrease in residual sumof squares. The difference between the sums of squares is called the extra sum of squares dueto xq, . . . , xp−1 given x1, . . . , xq−1 already in the model and is defined by the equation

SS(xq, . . . , xp−1|x1, . . . , xq−1) = SS(x1, x2, . . . , xp−1) − SS(x1, x2, . . . , xq−1)regression SS for regression SS for

full model reduced model

= SS(red)E − SS

(full)E

residual SS under residual SS underreduced model full model.

For calculation let

βT1 = [β0, β1, . . . , βq−1] βT2 = [βq, βq+1, . . . , βp−1]

so that

β =

[β1

β2

].

Similarly divideX into two submatricesX1 andX2 so thatX = [X1,X2] where

X1 =

1 x11 · · · xq−11...

......

1 x1n · · · xq−1n

X2 =

xq1 · · · xp−11...

...xqn · · · xp−1n

.The full model

Y = Xβ + ε = X1β1 +X2β2 + ε

has

SS(full)R = β

TXTY − nY 2

SS(full)E = Y TY − β

TXTY .

Similarly the reduced modelY = X1β1 + ε

has

SS(red)R = β

T

1XT1Y − nY 2

SS(red)E = Y TY − β

T

1XT1Y .

86

hence the extra sum of squares is

SSextra = βTXTY − β

T

1XT1Y .

To determine whether the change in sum of squares is significant, we must test the hypothesis

H0 : βq = βq+1 = . . . = βp−1 = 0

versusH1 : at least one of these is non-zero.

It can be shown that, if H0 is true,

F ? =SSextra/(p− q)

s2∼ F p−q

n−p

So we reject H0 at the α level ifF ? > F p−q

n−p(α)

and conclude there is sufficient evidence to show some (but not necessarily all) of the ‘extra’variables xq, . . . , xp−1 should be included in the model.

The ANOVA table is given by

Source d.f. SS MS= SSd.f. F ?

x1, .., xq−1 q − 1 SS(x1, .., xq−1)

xq, .., xp−1|x1, .., xq−1 p− q Extra SS Extra SSp−q

Extra SS(p−q)s2

Overall regression p− 1 SSR = SS(x1, .., xp−1)Residual n− p SSE s2

Total n− 1 SST

In the ANOVA table we use the notation xq, . . . , xp−1|x1, . . . , xq−1 to denote that this is theeffect of the variables xq, . . . , xp−1 given that the variables x1, . . . , xq−1 are already includedin the model.

Note that the F-test for the inclusion of a single variable xp−1, (this is the case q = p− 1) canalso be performed by an equivalent t-test where

t =βp−1

se(βp−1),

where se(βp−1) is the estimated standard error of βp−1. We compare the value t with tn−p fora two-sided test of H0 : βp−1 = 0. In fact as usual F ? = t2.

Note that we can repeatedly test individual parameters and we get the following Sums ofSquares and degrees of freedom

87

Source of variation df SSFull model p− 1 SSRX1 1 SS(β1)X2|X1 1 SS(β2|β1)X3|X1, X2 1 SS(β3|β1, β2)...

...Xp−1|X1, . . . , Xp−2 1 SS(βp−1|β1, . . . , βp−2)Residual n− p SSETotal n− 1 SST

Note that this only makes sense if there is some ordering of the terms (otherwise, the order inwhich we test is arbitrary).

Example

In an investigation to find factors affecting the productivity of plankton in the River Thames,measurements are taken at 17 monthly intervals. Productivity of oxygen in logarithmic units(Y ), the amount of chlorophyll (X1) and the amount of light (X2).

The full model was fitted and gave the regression equation

yi = −1.34 + 0.0118x1i + 0.0091x2i

The ANOVA Table is This gives a significant regression. If we analysis the simple linear

Source SS df Mean Square FRegression on X1 and X2 24.43 2 12.22 13.79

Residual 12.41 14 0.866Total 36.85 16

regression on X1 alone, we have the following regression equation

yi = 0.59 + 0.0224x1i

Thus the Anova Table is

Source SS df Mean Square FRegression on X1 alone 10.92 1 10.92 6.32

Residual 25.92 15 1.729Total 36.85 16

On the other hand, if we consider a simple linear regression with only X2, we have the fol-lowing regression equation

yi = −1.18 + 0.0106x2i

with Anova Table as follows

88

Source SS df Mean Square FRegression on X2 alone 21.87 1 21.87 21.89

Residual 14.98 15 0.999Total 36.85 16

The extra SS is 24.43−10.92 and the extra significance of X2 when X1 is in the model comesfrom here. Moreover, we can compute the F statistic as

F =13.51

0.866= 15.25

which is highly significant against F 115. If we look at the extra sum of squares when X2 is in

the model, we have the extra SS equal to 24.43− 21.87 = 2.56, thus the F statistic is

F =2.56

0.866= 2.9561

which is not significant at the 10% significance level. This suggests X2 alone gives a satisfac-tory model.

Example

Recall the sales data example, we have the full model with all the variables included in theregression

> modyf <- lm(y~ x1 + x2 + x3 + x4)

we run the Anova table as stated above and we have

> anova(modyf)Analysis of Variance Table


x1 1 1074 1074 40.995 7.805e-05 ***x2 1 44505 44505 1698.185 1.696e-12 ***x3 1 444 444 16.928 0.002096 **x4 1 43262 43262 1650.770 1.953e-12 ***Residuals 10 262 26---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The line x3 means X3 given X1 and X2 are in the model and the p-value suggest X3 is auseful regressor. F uses the s2 from the full regression, thus if we run the summary of themodel, we have

89

> summary(modyf)



-8.6881 -3.1604 0.4714 2.0541 6.0053


(Intercept) 177.2286 8.7874 20.169 1.98e-09 ***x1 2.1702 0.6737 3.221 0.00915 **x2 3.5380 0.1092 32.414 1.84e-11 ***x3 0.2035 0.3189 0.638 0.53760x4 -22.1583 0.5454 -40.630 1.95e-12 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Here the line x3 meansX3 given all the regressors are in the model. Obviously we can modifythe order of the values in the regression and we have the following command

> modyf1 <- lm(y~ x2 + x4 + x1 + x3)> summary(modyf1)



-8.6881 -3.1604 0.4714 2.0541 6.0053


(Intercept) 177.2286 8.7874 20.169 1.98e-09 ***x2 3.5380 0.1092 32.414 1.84e-11 ***x4 -22.1583 0.5454 -40.630 1.95e-12 ***x1 2.1702 0.6737 3.221 0.00915 **x3 0.2035 0.3189 0.638 0.53760---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

90


While the ANOVA table says

> anova(modyf1)Analysis of Variance Table


x2 1 44864 44864 1711.9076 1.630e-12 ***x4 1 44148 44148 1684.5702 1.766e-12 ***x1 1 262 262 9.9938 0.01014 *x3 1 11 11 0.4075 0.53760Residuals 10 262 26---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In this case with x3 last in the list, the p-value agrees with that shown in the summary becauseit is measuring the worth of X3 given that x1, x2 and x4 are in the model.

3.9.2 All subsets regression

If there is no natural ordering to the explanatory variables, then it is desirable to examine allpossible subsets. For example, if we have three candidate explanatory variables X1, X2 andX3, the possible models are

Yi ∼ N(µi, σ2)

with

µi = β0

µi = β0 + β1x1i

µi = β0 + β2x2i

µi = β0 + β3x3i

µi = β0 + β1x1i + β2x2i

µi = β0 + β1x1i + β3x3i

µi = β0 + β2x2i + β3x3i

µi = β0 + β1x1i + β2x2i + β3x3i

There are 8 = 23 models. In general with p− 1 explanatory variables there are 2p−1 possiblemodels, so even with p = 5 or 6 it is difficult to do a full comparison of all models.

Instead we usually compare models by calculating a few statistics for each model. Threestatistics that are useful are s2 = MSE , R2(adj) and Ck

91

Residual mean square MSE

If the full model with all candidate explanatory variables is correct then E(MS(full)E ) = σ2.

If we have excluded one or more important variables then E(MS(red)E ) > σ2. Hence we may

be able to identify the most appropriate model as being

(i) the one with the smallest number of explanatory variables (parameters) for which MSE isclose to MS

(full)E ;

(ii) the one with smallest MSE .

Condition (i) aims for the simplest acceptable model. Condition (ii) is more conservative, andshould be considered carefully as it may just suggest the full model.

A sketch of the smallest MSE for a given p− 1 against p− 1 can be useful.

Coefficient of determination R2

The coefficient of determination R2, usually expressed as a percentage, is defined by

R2 =SSRSST

× 100 =

(1− SSE

SST

)× 100.

Adding terms to a model always increases R2. However, the model with k parameters, for kas small as possible, having R2

k close to R2

full from the full model might be regarded as beingbest. Judgement is required and a plot of R2

k against k−1 or k can be useful to identify wherethe plot levels off.

Adjusted R2

Adjusted R2 is defined by

R2(adj) =

(1− (n− 1)

MSESST

)× 100.

It takes into account the number of predictors in the model and can be useful for compar-ing models with different numbers of parameters. Whereas R2 always increases when anadditional regressor is included R2(adj) only increases if the corresponding F statistic for in-clusion of that variable is greater than one. While an improvement on R2 we would normallyonly include the variable if its F value was rather larger than one.

Mallow’s statistic Ck

For a model with k parameters we define

Ck =SS

(k)E

σ2+ 2k − n.

92

If this model includes all important explanatory variables, then E(SS(k)E ) = (n− k)σ2 and so

E(Ck) =(n− k)σ2

σ2+ 2k − n = k.

If the model excludes important variables, then E(SS(k)E ) > (n − k)σ2 and so E(Ck) > k.

Hence we should choose a model with Ck close to k.

It can also be shown that Ck is an estimator of the mean square error of prediction, i.e.

1

n

n∑i=1

[Var(yi) + {bias(yi)}2].

This suggests minimizing Ck. Thus, we should choose either

(i) the model which minimizes Ck; or

(ii) a model with Ck close to k, with k small.

Again a plot of Ck versus k is useful.

Note that Ck depends on the unknown σ2, which we usually estimate by MS(full)E to get

Ck =SS

(k)E

MS(full)E

+ 2k − n.

R denotes Ck simply by Cp and using this estimate of Ck in the same way as Ck is oftenreasonable.

Note that this means that for the full model with p parameters Cp is always equal to p and socan’t be used to justify use of the full model.

Example

We consider again the Sales data example. We remind that R was giving the values of thedifferent R2, or adjusted R2. Here below, we report a table of these values:

Model adj R2 Ck R2

1 Regressor k = 2 – X4 60.9 1228.5 63.72 Regressors k = 3 – X2, X4 99.3 11.4 99.4

3 Regressors k = 4 – X1, X2, X4 99.6 3.4 99.74 Regressors k = 5 – X1, X2, X3, X4 99.6 5.0 99.7

The model with 3 regressors is the best as it

a) has the highest adjusted R2;

b) it has the smallest Mallow’s statistics and 3.4 is close to 4.

93

3.9.3 Automatic methods for selecting a regression model

A number of automatic methods have been proposed to determine which variables to includein the model. None are perfect and different methods give different results. Most involve asequence of statistical tests and the overall effect of these can be unclear.

The first suggestion is to fit all possible regressions and use the statistics discussed later tocheck the adequacy of the model. This is not feasible if we have more than four or fiveexplanatory variables.

A fairly straightforward procedure is the backward elimination procedure. This proceeds asfollows

1. Fit the multiple regression model with all explanatory variables. (Note that, if somevariables are close to being linear combinations of others, a high level of computationalaccuracy is required to ensure that XTX , a near singular matrix, can be inverted togive sensible results. We will return to this problem of Multicollinearity later)

2. Calculate the F (or t) statistic for the exclusion of each variable.

3. Compare the lowest value of the statistic with the predetermined significance level (sayα = 0.05) and omit it if necessary. (Note that the lowest value of the F statistic willcorrespond with the variable with lowest value of |estimate|/standard error).

4. We now have a new model with one fewer variables than we started with. We return tostep 2 and calculate the statistic of the exclusion of each variable still in the model.

5. Continue until a variable is not omitted.

A procedure which works in the opposite way to backwards elimination is called Stepwiseregression or Modified ‘forward regression’.

1. Start with the null model.

2. Introduce the explanatory variable with the highest F or t-value for inclusion.

3. After each inclusion, test the other variables in the current model to see whether theycan be omitted. Use say α = 0.10 for omission and α = 0.05 for inclusion.

4. Continue until no more can be included or omitted.

A difficulty is that, at each stage, the estimate of σ2 used may be upwardly biased and hencenot reliable. This can be overcome by using s2 based on all the variables, if this is computa-tionally feasible.

Another possibility is to leave out step 3 i.e. never omit a variable once it has been included.

The following example illustrates how the automatic methods can be used for a small exam-ple. For larger numbers of variables it would be necessary to use R to carry out the calcula-tions.

94

A hospital surgical unit was interested in predicting survival of patients undergoing a partic-ular type of kidney operation. A random selection of 54 patients was available for analysis.The recorded independent variables were X1 blood clotting score, X2 prognostic index, X3

enzyme function test score, X4 kidney function test score. The dependent variable Y is thelog of the survival time.

A researcher found the following table of residual sums of squares (RSS)

Variables RSS Variables RSSNULL 3.973 X2 +X3 0.7431X1 3.496 X2 +X4 1.392X2 2.576 X3 +X4 1.245X3 2.215 X1 +X2 +X3 0.1099X4 1.878 X1 +X2 +X4 1.390

X1 +X2 2.232 X1 +X3 +X4 1.116X1 +X3 1.407 X2 +X3 +X4 0.465X1 +X4 1.876 X1 +X2 +X3 +X4 0.1098

Fit a multiple regression model by

1. backward elimination

2. stepwise regression

3. forward fitting allowing variables to be included only, never excluded.

In each case use the estimate of σ2 from the full model. Comment on your results.

Solution. There are 54 observations and the full model has 4 explanatory variables so

s2 =0.1098

49= 0.00224.

1. For backward elimination we test if we can exclude the variable whose exclusion in-creases the residual sum of squares least. Thus the first candidate for exclusion is X4.The F value is (0.1099−0.1098)/s2 = 0.044 and, as this is less than 1, we can certainlyexclude X4. Once X4 has been excluded then excluding the variable X1 produces thesmallest increase in residual sum of squares. The F value is (0.7431 − 0.1099)/s2 =282.6, so there is clear evidence against the hypothesis that X1 should be excluded andhence our final model is X1 +X2 +X3.

2. For stepwise regression we start with the null model and test if we should include thevariable whose inclusion reduces the residual sum of squares most.

The first candidate for inclusion is X4. The F value is (3.973 − 1.878)/s2 = 934.9, sowe must includeX4. NextX3 produces the biggest reduction in residual sum of squaresfrom the model with just X4; the F value is (1.878 − 1.245)/s2 = 282.5, so we mustalso include X3. Then X2 produces the biggest reduction in residual sum of squaresfrom X3 + X4; the F value is (1.245 − 0.465)/s2 = 348.1, so we have to include X2.

95

We must now consider whether, with X2 included, X3 or X4 could be omitted. The Fvalues are (1.392− 0.465)/s2 = 413.7 and (0.7431− 0.465)/s2 = 124.1, so we cannotexclude either. Now can we include X1? The F value is (0.465− 0.1098)/s2 = 158.5,so we must include X1. Now we can exclude X4 as for backward elimination but noother variables, so the final model is X1 +X2 +X3.

3. If we use forward fitting but never exclude variables then our final model will be X1 +X2 +X3 +X4.

Although X4 is the single most informative variable, it is unnecessary when the othersare included in the model. This illustrates why forward fitting is not to be recommended.

Akaike’s Information Criterion

The problems with using a series of significance tests for which the overall Type I error isunclear can be addressed by using Akaike’s Information Criterion (AIC) to find the bestmodel.

We define AIC byAIC = 2(p+ 1)− 2 logL

where L is the likelihood evaluated at the maximum likelihood estimates of the unknownparameters. Note that for a multiple regression model we have p regression parametersβ0, β1, . . . , βp−1 and the variance σ2.

As we discussed in the section on maximum likelihood in Chapter 2 the maximum likelihoodestimates of the regression parameters are just the least squares estimates. The mle of σ2,however, is not S2 but

σ2 =SSEn

.

It can be shown that−2 logL = n(log 2π + log σ2 + 1).

We want to minimise the AIC. There are different ways to carry out the procedure in thesame way that you can do backwards elimination or forwards fitting.

So we could start with the full model. We compare the AIC for the full model with the AICfor each model omitting one regressor variable. We find the model with minimum AIC ifthis is the full model, we stop. Otherwise we start from the current model and repeat. Westop once the chosen model isn’t changed. R automates this process using the command >reduced.model <- step(mody, direction="backward").

Another possibility is to start with the null model, in R the null model is designated by y ∼ 1,then we try adding variables and add the variable which has the smallest AIC. In this acse wehave to tell R which variables could be added using the command aic.forward.model<- step(modyn, scope= x1+x2+x3+x4, direction="forward") where wesuppose there are 4 possible variables which might be added. Again we continue until wedon’t change the model.

You can also try adding and subtracting models from the current one by using direction= "both".

96

The following example shows how this works with the sales data.

> sales<- read.csv("sales.csv")> x1<-sales[ ,1]> x2<-sales[ ,2]> x3<-sales[ ,3]> x4<-sales[ ,4]> y<-sales[ ,5]

> mody<-lm(y~x1+x2+x3+x4)

> reduced.model <- step(mody, direction="backward")Start: AIC=52.91y ~ x1 + x2 + x3 + x4

Df Sum of Sq RSS AIC- x3 1 11 273 51.508<none> 262 52.909- x1 1 272 534 61.586- x2 1 27535 27797 120.869- x4 1 43262 43524 127.595

Step: AIC=51.51y ~ x1 + x2 + x4

Df Sum of Sq RSS AIC<none> 273 51.508- x1 1 262 535 59.604- x2 1 31813 32086 121.022- x4 1 43695 43968 125.747

> modyn<- lm(y~1)> aic.forward.model <- step(modyn, scope=~x1+x2+x3+x4,direction="forward")Start: AIC=132.42y ~ 1

Df Sum of Sq RSS AIC+ x4 1 57064 32483 119.21+ x2 1 44864 44683 123.99<none> 89547 132.42+ x3 1 8314 81233 132.96+ x1 1 1074 88473 134.24

Step: AIC=119.21y ~ x4

97

Df Sum of Sq RSS AIC+ x2 1 31948 535 59.604<none> 32483 119.206+ x3 1 3873 28610 119.302+ x1 1 397 32086 121.022

Step: AIC=59.6y ~ x4 + x2

Df Sum of Sq RSS AIC+ x1 1 261.911 272.75 51.508<none> 534.66 59.604+ x3 1 0.636 534.03 61.586

Step: AIC=51.51y ~ x4 + x2 + x1

Df Sum of Sq RSS AIC<none> 272.75 51.508+ x3 1 10.679 262.07 52.909> aic.forward.model <- step(modyn, scope=~x1+x2+x3+x4,direction="both")Start: AIC=132.42y ~ 1

Df Sum of Sq RSS AIC+ x4 1 57064 32483 119.21+ x2 1 44864 44683 123.99<none> 89547 132.42+ x3 1 8314 81233 132.96+ x1 1 1074 88473 134.24

Step: AIC=119.21y ~ x4

Df Sum of Sq RSS AIC+ x2 1 31948 535 59.604<none> 32483 119.206+ x3 1 3873 28610 119.302+ x1 1 397 32086 121.022- x4 1 57064 89547 132.417

Step: AIC=59.6y ~ x4 + x2

98

Df Sum of Sq RSS AIC+ x1 1 262 273 51.508<none> 535 59.604+ x3 1 1 534 61.586- x2 1 31948 32483 119.206- x4 1 44148 44683 123.989

Step: AIC=51.51y ~ x4 + x2 + x1

Df Sum of Sq RSS AIC<none> 273 51.508+ x3 1 11 262 52.909- x1 1 262 535 59.604- x2 1 31813 32086 121.022- x4 1 43695 43968 125.747

We can see that all the methods choose the model with variables X1, X2 and X4. With lotsof variables it is possible that the different methods choose different models.

3.10 Problems with fitting regression models

3.10.1 Near-singular XTX

We have seen that if XTX is singular, no unique least squares estimators exist. The singu-larity is caused by linear dependence among the explanatory variables.

For example suppose we have a model with no intercept and 2 regressors. If the values of thetwo regressors were equal then

X =

−1 −1−1 −1

1 11 1

⇒XTX =

[4 44 4

], det(XTX) = 0

and soXTX is singular and so no unique least squares estimator exists.

If there are “near” linear dependencies among the explanatory variables, the XTX matrixcan be “nearly” singular. Here the two variables do not have equal values but they are quiteclose so

X =

−1 −0.9−1 −1.1

1 0.91 1.1

⇒XTX =

[4 44 4.04

], det(XTX) = 0.16

99

and soXTX is nonsingular but det(XTX) is close to zero. We find

(XTX)−1 =1

0.16

[4.04 −4−4 4

]=

[25.25 −25−25 25

]Now recall thatVar(βj) = σ2cjj so in this case var(β1) = 25.25σ2 and var(β2) = 25σ2 whichare both large. Also cov(β1, β2) = −25σ2.

By contrast if

X =

−1 −1−1 1

1 −11 1

⇒XTX =

[4 00 4

], det(XTX) = 16

(XTX)−1 ==

[0.25 0

0 0.25

]so var(βj) = 0.25σ2 for j = 0, 1 and cov(β0, β1) = 0.

In these simple cases we can see exactly where the problems are. With more variables it isnot always obvious that some columns of theX matrix are close to being linear combinationsof others. This problem is sometimes called multicollinearity. These examples illustrate thegeneral problems caused by multicollinearity:

(i) some or all parameter estimators will have large variances;

(ii) difficulties may arise in variable selection as it will be possible to get very different mod-els that fit equally well;

(iii) some parameters may be the “wrong” sign, this can occur when it is obvious that in-creasing the value of a regressor should result in an increase in the dependent variable.

3.10.2 Variance inflation factor

The variance inflation factor (VIF) can be used to indicate when multicollinearity may be aproblem.

Consider a regression problem with p − 1 regressors. Suppose we fitted a regression modelwith xj as the dependent variable and the remaining p−2 variables as the regressor variables.Let R2

j be the coefficient of determination (not expressed as a percentage) for this model.Then we define the jth variance inflation factor as

V IFj =1

1−R2j

.

A large value of R2j (close to 1) will give a large V IFj . In this context a V IF > 10 is taken

to indicate that the multicollinearity may cause problems of the sort noted above.

100

As an example we look at the relationship between body fat (Y ) and three body measure-ments: triceps skinfold thickness (X1), thigh circumference (X2) and midarm circumference(X3). We have data from 20 individuals.

The R output is given below.

> modf<-lm(y~x1+x2+x3)> summary(modf)

Call:lm(formula = y ~ x1 + x2 + x3)


-3.7263 -1.6111 0.3923 1.4656 4.1277


(Intercept) 117.085 99.782 1.173 0.258x1 4.334 3.016 1.437 0.170x2 -2.857 2.582 -1.106 0.285x3 -2.186 1.595 -1.370 0.190


> library(car)Warning message:package ‘car’ was built under R version 3.4.3> vif(modf)

x1 x2 x3708.8429 564.3434 104.6060

> modr<-lm(y~x1+x3)> summary(modr)



-3.8794 -1.9627 0.3811 1.2688 3.8942


101

(Intercept) 6.7916 4.4883 1.513 0.1486x1 1.0006 0.1282 7.803 5.12e-07 ***x3 -0.4314 0.1766 -2.443 0.0258 *---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


> vif(modr)x1 x3

1.265118 1.265118

> modr1<-lm(y~x2+x3)> summary(modr1)



-4.0777 -1.8296 0.1893 1.3545 4.1275


(Intercept) -25.99695 6.99732 -3.715 0.00172 **x2 0.85088 0.11245 7.567 7.72e-07 ***x3 0.09603 0.16139 0.595 0.55968---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


> vif(modr1)x2 x3

1.00722 1.00722> mody<-lm(y~x2)> summary(mody)

102



-4.4949 -1.5671 0.1241 1.3362 4.4084


(Intercept) -23.6345 5.6574 -4.178 0.000566 ***x2 0.8565 0.1100 7.786 3.6e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


> mody1<-lm(y~x1)> summary(mody1)



-6.1195 -2.1904 0.6735 1.9383 3.8523


(Intercept) -1.4961 3.3192 -0.451 0.658x1 0.8572 0.1288 6.656 3.02e-06 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


The overall regression is highly significant, F = 21.52, P < 0.0005 but none of the variablesis significant even at a significance level of α = 0.10. This is indicative of a problem withmulticollinearity, the variances of the parameter estimates are inflated.

The variance inflation factors are 708.843, 564.343 and 104.606. Any factor larger than 10 isindicative of a problem with multicollinearity so there are certainly problems here.

The matrix plot is shown below.

103

It is clear that X1 and X2 are highly correlated. The correlation is 0.924.

For the model X2 the residual plots show satisfactory behaviour and no reason to doubt theusual model assumptions.

Similarly for the model X1 +X3 the plots are satisfactory.

The statistics for both models are fairly close and for reasons of parsimony I would tend tosupport the simpler one variable model with just X2.

3.11 Model checking

3.11.1 Standardised residuals

We defined the vector of residuals e = Y − Y = (I −H)Y and showed that E(e) = 0 andVar(e) = σ2(I −H), where H is the hat matrix with elements we may write as hii. ThenVar(ei) = (1− hii)σ2 and cov(ei, ej) = −hijσ2.

Note that in Chapter 2 we used the notation vi for hii, either is acceptable.

We see that the residuals may have different variances. This may make detecting outlyingobservations more difficult. So it is useful to standardise the residuals. We define

di =ei√

S2(1− hii)to be the standardised (or studentised) residual. If the model is appropriate we have (assumingnormality) that di ∼ tn−p. Also for large samples hij will be small for i 6= j and we have anasymptotic result that the standardised residuals are independent and identically distributedas N(0, 1). This allows us to carry out model checking. Note however that it will be mostreliable for large samples.

We use the standardised residuals for model checking as follows:

(i) To check the form of the linear predictor

E(Yi) = β0 + β1x1i + . . .+ βp−1xp−1i

104

plot di versus xji for each j = 1, . . . , p− 1. Any curvature suggests that a higher orderterm in xj is needed.

(ii) To check the assumption of constant variance plot di against the fitted values yi. Anypattern of variance, especially a funnel shape, should show up. In this case a transfor-mation may be needed.Other sorts of patterns in this plot may be indicative of a lack offit.

(iii) To check the assumption of uncorrelated errors, plot di versus time if known. Any auto-correlation should be detected.

(iv) To check the assumption of normality a normal plot of the ordered standardised residualsshould be found. Departure from a straight line may indicate that a transformation isneeded.

As with simple linear regression outliers may be evident from the residual plots.

3.11.2 Leverage and influence

We noted in section 2.7.4 that an observation with high leverage was potentially influential.We discuss this in greater detail here. Note that vi of that section is hii here.

The fitted model is

Y = µ = Xβ = X(XTX)−1XTY = HY

and the ith fitted value can be written as

µi =n∑i=1

hijyj = hiiyi +∑i 6=j

hijyj.

The weight hii indicates how heavily yi contributes to the fitted value µi. The quantity hii iscalled the leverage of case i. The ith diagonal element of the hat matrix hii has the followingproperties:

1. Var(ei) = σ2(1−hii), hence hii < 1, this means that a hii close to 1 will give Var(ei) ≈0 and so µi ≈ yi, that is, the fitted value is very close to the ith observation.

2. hii is usually small when xi is close to the centroid x = (x1, . . . xp−1)T and it is largewhen xi is far from x.

3. When p = 2 (SLR)

hii = vi =1

n+

(xi − x)2

Sxx

and hii = 1n

when xi = x

4. In general, 1n≤ hii < 1.

105

5.∑n

i=1 hii = p since

n∑i=1

hii = trace(H)

= trace(X(XTX)−1XT )

= trace(XTX)(XTX)−1

= trace Ip

= p.

Hence the average leverage is pn

. A case for which hii > 2pn

is considered a high leveragecase and one with hii > 3p

nis considered a very high leverage case.

There may be various reasons for high leverage. It may be that the data of the case werecollected differently than the others or simply mis-recorded. It may just be that the case hasone or more values which are a-typical but correctly recorded. A low-leverage case will notinfluence the fit much, a high leverage case indicates potential influence, but not all high-leverage cases are influential.

Example 3.1. Suppose we consider a simple linear regression model with xi = i for 1, 2, . . . , 10.

X =

1 11 2...

...1 10

(XTX) =

[10 5555 385

](XTX)−1 =

1

825

[385 −55−55 10

]

H = X(XTX)−1XT =1

825

1 x1

1 x2...

...1 x10

[

385 −55−55 10

] [1 1 · · · 1x1 x2 · · · xn

]

hii = (385− 110xi + 10x2i )/825

So for i = 1, 10 the values of hii are 0.346 0.248 0.176 0.127 0.103 0.103 0.127 . . . 0.346.Now suppose we move point x10 from 10 to 20.

(XTX) =

[10 6565 685

](XTX)−1 =

1

2625

[685 −65−65 10

]hii = (685− 130xi + 10x2

i )/2625

So for i = 1, 10 the values of hii are now 0.215 0.177 0.147 0.124 0.109 0.101 0.101 0.1090.124 0.794. The value of h10,10 is 0.794 > 2 × 2/10 = 0.4, so observation 10 has highleverage.

106

3.11.3 Cook’s distance

Recall from section 2.7.4 that Cook’s distance provides a measure of the influence of anobservation on the fitted model. A further definition of this for the general linear model isgiven by

Di =(β − β(i))

T (XTX)(β − β(i))

ps2.

This can be thought of as a scaled distance between β and β(i) which is the estimate of βomitting the ith observation.

Otherwise there is nothing to add to the earlier discussion.

3.12 What is a linear model?

A linear model is linear in the parameters. The regressors can be non-linear. A model whichis not linear in the parameters can sometimes be linearised by transforming the dependentvariable.

Consider the following examples:

(a) Yi = β0 + β1x1i + β2 cosx2i + β3√x3i + εi

This is a linear model in the parameters.

(b) Yi = εi exp(β0 + β1x1i + β2x−12i )

This is not a linear model in the parameters, but it can be linearized by the naturallogarithm transformation. The transformed model is:

ln(Yi) = β0 + β1x1i + β2x−121 + ln(εi).

(c) Yi = β0 + exp(β1x1i) + εi

This model is not linear in β1 and it cannot be linearized by a transformation of theresponse.

(d) Yi = 1 + exp(β0 + β1x1i + εi)

This is not a linear model in the parameters, but it can be linearized by subtracting theconstant 1 from both sides of the equation and taking the natural logarithm to get:

ln(Yi − 1) = β0 + β1x1i + εi.

(e) Yi = (β0 + β1x1i + β2x2i + εi)−1

This is not a linear model in the parameters, but it can be linearized by inverting theresponse assuming that the response cannot take the value zero. The transformed modelis

1

Yi= β0 + β1x1i + β2x2i + εi.

107

3.13 Polynomial regression

Another useful class of linear models are polynomial regression models, e.g.,

Yi = β0 + β1xi + β11x2i + εi,

the quadratic regression model.

This can be written asY = Xβ + ε, ε ∼ Nn(0, σ2I),

where rows of matrixX are of the form (1, xi, x2i ) and β = (β0, β1, β11)T.

The quadratic model belongs to the class of linear models as it is linear in the parameters.

If we wish to compare the quadratic regression model with the simple linear regression modelwe fit

Yi = β0 + β1xi + β11x2i + εi

and test the null hypothesis H0 : β11 = 0 versus H1 : β11 6= 0.

If we rejectH0 the quadratic model gives a significantly better fit than the simple linear model.

This can be extended to cubic and higher order polynomials.

Higher powers of x quickly become large and it is usually sensible to centre x by subtractingits mean.

Denotezi = xi − x.

Then, for some parameters γ we can write,

E(Yi) = γ0 + γ1zi + γ11z2i

= γ0 + γ1(xi − x) + γ11(xi − x)2

= (γ0 − γ1x+ γ11x2) + (γ1 − 2γ11x)xi + γ11x

2i

= β0 + β1xi + β11x2i ,

where β0 = γ0 − γ1x+ γ11x2, β1 = γ1 − 2γ11x and β11 = γ11.

We can also have a second (or higher) order polynomial regression model in two (or more)explanatory variables.

For example,

Yi = β0 + β1x1i + β2x2i + β11x21i + β22x

22i + β12x1ix2i + εi.

This model is very commonly used in experiments for exploring response surfaces.

Note that if the second order terms x21i, x

22i and x1ix2i are in the model then we should not

consider removing the first order terms x1i and x2i.

108

3.13.1 Example: Crop Yield

An agronomist studied the effects of moisture (X1 in inches) and temperature (X2 in ◦C) onthe yield (Y ) of a new hybrid tomato. He recorded values of 25 random samples of yieldobtained for various levels of moisture and temperature. The data are below.

y x1 x2 y x1 x2 y x1 x2 y x1 x2 y x1 x249.2 6 20 51.5 8 20 51.1 10 20 48.6 12 20 43.2 14 2048.1 6 21 51.7 8 21 51.5 10 21 47.0 12 21 42.6 14 2148.0 6 22 50.4 8 22 50.3 10 22 48.0 12 22 42.1 14 2249.6 6 23 51.2 8 23 48.9 10 23 46.4 12 23 43.9 14 2347.0 6 24 48.4 8 24 48.7 10 24 46.2 12 24 40.5 14 24

Looking at the data we see that

• For moisture, Y clearly shows some curvature..

• There is a slight decrease in Y with increasing temperature on average, but there isalso large variability in the data.

• There is no dependence between the two explanatory variables.

Below, there is the R output for fitting the model

Yi = γ0 + γ1z1i + γ2z2i + γ11z21i + γ22z

22i + γ12z1iz2i + εi,

where z1 = x1 − x1 and z2 = x2 − x2

andεi ∼

iidN(0, σ2).

> tomato<-read.csv("tomato.csv")> y<-tomato[,1]> z1<-tomato[,2]> z2<-tomato[,3]> z1sq<-z1^2> z2sq<-z2^2> z12<-z1*z2> mody<-lm(y~z1+z2+z1sq+z2sq+z12)> summary(mody)

Call:lm(formula = y ~ z1 + z2 + z1sq + z2sq + z12)


-1.09100 -0.55029 -0.06971 0.31143 1.94029

Coefficients:

109

Estimate Std. Error t value Pr(>|t|)(Intercept) 50.38400 0.33488 150.454 < 2e-16 ***z1 -0.76200 0.06028 -12.640 1.07e-10 ***z2 -0.53000 0.12057 -4.396 0.000311 ***z1sq -0.29286 0.02548 -11.496 5.33e-10 ***z2sq -0.13857 0.10190 -1.360 0.189792z12 -0.00550 0.04263 -0.129 0.898696---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


> anova(mody)Analysis of Variance Table


z1 1 116.129 116.129 159.7669 1.072e-10 ***z2 1 14.045 14.045 19.3227 0.0003107 ***z1sq 1 96.057 96.057 132.1529 5.331e-10 ***z2sq 1 1.344 1.344 1.8492 0.1897916z12 1 0.012 0.012 0.0166 0.8986957Residuals 19 13.810 0.727---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

• We can see that γ22 is not significantly different from zero when all other parametersare in the model, p-value is bigger the 0.1 (p = 0.19).

• Similarly, γ12; for this parameter we have a very large p-value (p = 0.899).

The overall regression is highly significant (p < 0.001).

We will test the hypothesis

H0 : γ22 = γ12 = 0 vs H1 : ¬H0.

Here we haven = 25, p = 6, q = 4

SSextra = 1.344 + 0.012 = 1.356, MSfullE = 0.727

F =SSextra/(p− q)

MSfullE

and Fcal =1.356/2

0.727= 0.9326

This gives p = 0.4108. Hence, we do not have evidence to reject the null hypothesis that bothparameters are zero.

A new model fit: without terms z22 , z1z2.

110

> mody2<-lm(y~z1+z2+z1sq)> summary(mody2)

Call:lm(formula = y ~ z1 + z2 + z1sq)


-0.9994 -0.4691 -0.2331 0.1931 2.0569


(Intercept) 50.10686 0.26487 189.17 < 2e-16 ***z1 -0.76200 0.06009 -12.68 2.62e-11 ***z2 -0.53000 0.12019 -4.41 0.000244 ***z1sq -0.29286 0.02539 -11.53 1.51e-10 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


> stdres1<-rstandard(mody2)> shapiro.test(stdres1)



• Here we see that the residuals are not normal (the p-value is 0.013 for the test of nor-mality).

• Hence, some further analysis is needed. Various transformations of the response vari-able did not work here.

• The residuals improve when z2 is removed. The new model fit is below.

> mody3<-lm(y~z1+z1sq)> summary(mody3)

Call:lm(formula = y ~ z1 + z1sq)

Residuals:

111

Min 1Q Median 3Q Max-2.0594 -1.0114 0.1931 0.9931 1.5269


(Intercept) 50.10686 0.35915 139.516 < 2e-16 ***z1 -0.76200 0.08148 -9.352 4.03e-09 ***z1sq -0.29286 0.03443 -8.505 2.10e-08 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


> stdres2<-rstandard(mody3)> shapiro.test(stdres2)



The residuals are better here, do not clearly contradict the assumption of normality.

• The model is parsimonious, so we may stay with this one.

• However, we could advise the agronomist that in future experiments of this kind hemight consider a wider range of temperature values, which would help to establishclearly whether this factor could be significant for yield of the new hybrid tomatoes.

4 Theory of Linear Models

In this final chapter we prove some results which justify some of the methods we have used.

4.1 The Gauss-Markov Theorem

Theorem 4.1. Given the linear model

Y = Xβ + ε,

where E(ε) = 0 and Var(ε) = σ2In the least squares estimator β = (XTX)−1XTY issuch that lT β is the minimum variance linear unbiased estimator of the estimable functionlTβ of the parameters β.

112

Note: We call such an estimator BLUE for Best Linear Unbiased Estimator. It is the estimatorthat among all unbiased estimators of the form cTY , has the smallest variance.

Proof. lT β is a linear combination of the observations Y ,

lT β = lT (XTX)−1XTY .

Let cTY be another linear unbiased estimator of lTβ. then

E(cTY ) = cT E(Y ) = cTXβ = lTβ,

that is, for lT = cTX , cTY is unbiased. Now,

Var(cTY ) = cT Var(Y )c = σ2cTIc = σ2cTc.

Also,

Var(lT β) = lT Var(β)l = σ2lT (XTX)−1l

= σ2cTX(XTX)−1XTc = σ2cTHc

Then

Var(cTY )− Var(lT β) = σ2(cTc− cTHc)= σ2cT (I −H)c

= σ2 cT (I −H)T︸︷︷︸=zT

(I −H)c︸︷︷︸=z

= σ2zTz ≥ 0.

Hence Var(cTY ) ≥ Var(lTβ) and so lT β is BLUE of lTβ. �

4.2 Sampling distribution of MSE (S2)

First we will show that in the linear model

Y = Xβ + ε, ε ∼ N(0, σ2I),

we haveE(S2) = σ2.

For this we need some results on matrix algebra.

Lemma 4.1. For an idempotent matrix A of rank r, there exists an orthogonal matrix C(CTC = I) such that

A = CDCT ,

where

D =

[Ir 00 0

].

113

Lemma 4.2. Properties of trace

1. For any matricesA andD of appropriate dimensions and for a scalar k we have

(a) trace(AD) = trace(DA)

(b) trace(A+D) = trace(A) + trace(D)

(c) trace(kA) = k trace(A)

2. For an idempotent matrixA

trace(A) = rank(A)

Proof. 1. A simple consequence of the definition of trace.

2. IfA is idempotent then by Lemma 4.1A = CDCT for an orthogonal C, then

trace(A) = trace(CDCT )

= trace(CTCD)

= trace(D)

= r.

�

Lemma 4.3.rank(I −H) = n− p.

Proof.

rank(I −H) = trace(I −H)

= trace(I)− trace(H)

= n− trace{X(XTX)−1XT}= n− trace{XTX(XTX)−1}= n− trace(Ip)

= n− p

�

Lemma 4.4. Let z be a random vector such that

E(z) = µ Var(z) = V

thenE(zTAz) = trace(AV ) + µTAµ

for any matrixA.

114

Proof. We haveV = E(zzT )− E(z) E(zT ) = E(zzT )− µµT .

HenceE(zzT ) = V + µµT .

Then

E(zTAz) = E[trace(zTAz)]

= E[trace(zzTA)]

= trace[AE(zzT )]

= trace[A(V + µµT )]

= trace(AV ) + trace(µTAµ)

= trace(AV ) + µTAµ

Theorem 4.2. Let Y = Xβ + ε be a linear model such that E(Y ) = Xβ and Var(Y ) =σ2In then the error sum of squares, SSE , has expectation equal to

E(SSE) = (n− p)σ2,

where p is the number of parameters in the model.

Proof.

SSE = Y T (I −H)Y

E(SSE) = E[Y T (I −H)Y ]

= trace(I −H) Var(Y ) + E(Y )T (I −H) E(Y )

= σ2 trace(I −H) + βTXT (I −X(XTX)−1XT )Xβ

= σ2(n− p) + βT (XTX −XTX (XTX)−1XTX︸︷︷︸=I

)β

= σ2(n− p)

Corrolary 4.1.E(MSE) = σ2

To show that(n− p)S2

σ2∼ χ2

n−p,

the result we have used for deriving F tests, we will need the following results.

Lemma 4.5. If Zi, i = 1, . . . , r are independent and identically distributed, each with astandard normal distribution then

r∑i=1

Z2i ∼ χ2

r

Lemma 4.6. The vector of residuals can be written as

e = (I −H)ε.

115

Proof.

e = Y − Y= Y −HY= (I −H)Y

= (I −H)(Xβ + ε)

= Xβ −HXβ + (I −H)ε

= (I −H)ε

Corrolary 4.2.e ∼ Nn(0, σ2(I −H))

Theorem 4.3.(n− p)S2

σ2∼ χ2

n−p.

Proof.

(n− p)S2

σ2=

1

σ2SSE

=1

σ2eTe

=1

σ2εT (I −H)T (I −H)ε

=1

σ2εT (I −H)ε

=1

σ2εTCDCTε

= zTDz,

where, by Lemma 4.1 as I −H is idempotent, I −H = CDCT with C orthogonal

D =

[In−p 00 0

]and z = 1

σcTε.

We assume that ε ∼ Nn(0, σ2I). Hence , z is also normal with E(z) = 0 and

Var(z) =1

σ2CT Var(ε)C

=σ2

σ2CTC

= CTC

= I

as C is orthogonal. Hencez ∼ N(0, I)

116

and so zi are independent and each distributed as N(0, 1).

Also

zTDz =

n−p∑i=1

z2i .

Hence by Lemma 4.5 we have

(n− p)S2

σ2=

n−p∑i=1

z2i ∼ χ2

n−p.

Further we will show the result given in section 3.3 that

F =MSRMSE

∼ F p−1n−p .

In that section we showed that

SSR = Y THY − ny2.

This can be written as

SSR = Y TX(XTX)−1XTY − ny2

= βTXTY − ny2

= (β0 βT

∗ )

(1T

XT∗

)Y − ny2

where

XT∗ =

x11 · · · x1n...

...xp−11 · · · xp−1n

, β∗ =

β1...

βp−1

.

This gives

SSR = (β0βT

∗ )

(1TYXT∗Y

)− ny2

= β01TY + β

T

∗XT∗Y − ny2

= β0ny − ny2 + βT

∗XT∗Y

Nowβ0 = y − (β1x1 + · · ·+ βp−1xp−1) = y − β

T

∗ x.

117

Hence

SSR = ny2 − nβT

∗ xy − ny2 + βT

∗XT∗Y

= βT

∗ (XT∗ −

n

nx1T )Y as y =

1

n1TY

= βT

∗ (XT∗ − x1T )Y

= βT

∗

x11 · · · x1n

......

xp−11 · · · xp−1n

− x1 · · · x1

......

xp−1 · · · xp−1

︸︷︷︸=XT

∗c

Y

= βT

∗XT∗cY

Now, in a centred model, i.e. one with centred explanatory variables xij − xi we have β0 = y

and all other βi are as in the non-centred model, then

SSR = βT

∗XT∗Y .

Theorem 4.4. Under H0 : β∗ = 0 we have

SSRσ2∼ χ2

p−1

Proof. E(β∗) = β∗ = 0 under H0.

Var(β∗) = σ2(XT∗cX∗c)

−1 and β∗ ∼ N(0, σ2(XT∗cX∗c)

−1).

Let C be a p− 1× p− 1 dimensional matrix such that

CTC = XT∗cX∗c

Now multiply on the left by (CT )−1, so that

C = (CT )−1XT∗cX∗c

Now multiply on the right by (XT∗cX∗c)

−1 so that

C(XT∗cX∗c)

−1 = (CT )−1

Now multiply on the right by (CT ) to give

C(XT∗cX∗c)

−1(CT ) = I

Hencez = Cβ∗ ∼ Np−1(0, σ2I)

that iszi ∼ N(0, σ2)

118

and are independent.

Now

SSR = β∗ XT∗Y︸︷︷︸

=XT

∗cX∗cβ∗

= βT

∗XT∗cX∗cβ∗

= βT

∗CTCβ∗

= zTz

=

p−1∑i=1

z2i

HenceSSRσ2∼ χ2

p−1

�

Corrolary 4.3.(p− 1)MSR

σ2∼ χ2

p−1.

From this, theorem 4.3 and from the independence of MSE and MSR we obtain

Theorem 4.5.F =

MSRMSE

∼ F p−1n−p .

119

Documents

Statistical Modeling I - 2020.qmplus.qmul.ac.uk