Chapter 2: Simple Linear Regression 2.2-2.3 Parameter ... · PDF fileChapter 2: Simple Linear...

Preview:

Citation preview

Chapter 2: Simple Linear Regression

• 2.1 The Model• 2.2-2.3 Parameter Estimation• 2.4 Properties of Estimators• 2.5 Inference• 2.6 Prediction• 2.7 Analysis of Variance• 2.8 Regression Through the Origin• 2.9 Related Models

1

2.1 The Model

• Measurement of y (response) changes in a linear fashion with asetting of the variable x (predictor):

y =β0 + β1x

linear relation+

ε

noise

◦ The linear relation is deterministic (non-random).

◦ The noise or error is random.

• Noise accounts for the variability of the observations about thestraight line.No noise⇒ relation is deterministic.Increased noise⇒ increased variability.

2

• Experiment with this simulation program:

simple.sim <-

function(intercept=0,slope=1,x=seq(1,10),sigma=1){

noise <- rnorm(length(x),sd = sigma)

y <- intercept + slope*x + noise

title1 <- paste("sigma = ", sigma)

plot(x,y,pch=16,main=title1)

abline(intercept,slope,col=4,lwd=2)

}

• > source("simple.sim.R")

> simple.sim(sigma=.01)

> simple.sim(sigma=.1)

> simple.sim(sigma=1)

> simple.sim(sigma=10)

3

Simulation Examples

2 4 6 8 10

24

68

10

sigma = 0.01

x

y

2 4 6 8 10

24

68

10

sigma = 0.1

x

y

2 4 6 8 10

24

68

10

sigma = 1

x

y

2 4 6 8 10

010

2030

sigma = 10

x

y

4

The Setup

• Assumptions:1. E[y|x] = β0 + β1x.2. Var(y|x) = Var(β0 + β1x+ ε|x) = σ2.

• Data: Suppose data y1, y2, . . . , yn are obtained at settingsx1, x2, . . . , xn, respectively. Then the model on the data is

yi = β0 + β1xi + εi

(εi i.i.d. N(0, σ2) and E[yi|xi] = β0 + β1xi.)Either1. the x’s are fixed values and measured without error

(controlled experiment)OR

2. the analysis is conditional on the observed values of x(observational study).

5

2.2-2.3 Parameter Estimation, Fitted Values and Residuals

1. Maximum Likelihood Estimation

Distributional assumptions are required

2. Least Squares Estimation

Distributional assumptions are not required

6

2.2.1 Maximum Likelihood Estimation

◦ Normal assumption is required:

f(yi|xi) =1√2πσ

e−12σ2(yi−β0−β1xi)

2

Likelihood: L(β0, β1, σ) =∏f(yi|xi)

∝1

σne− 1

2σ2

∑ni=1(yi−β0−β1xi)

2

◦ Maximize with respect to β0, β1, and σ2.

◦ (β0, β1): Equivalent to minimizingn∑i=1

(yi − β0 − β1xi)2

(i.e. Least-squares) σ2: SSE/n, biased.

7

2.2.2 Least Squares Estimation

◦ Assumptions:

1. E[εi] = 0

2. Var(εi) = σ2

3. εi’s are independent.

• Note that normality is not required.

8

Method

• Minimize

S(β0, β1) =n∑i=1

(yi − β0 − β1xi)2

with respect to the parameters or regression coefficients β0 andβ1: β0 and β1.

◦ Justification: We want the fitted line to pass as close to all of thepoints as possible.

◦ Aim: small Residuals (observed - fitted response values):

ei = yi − β0 − β1xi

9

Look at the following plots:

> source("roller2.plot")

> roller2.plot(a=14,b=0)

> roller2.plot(a=2,b=2)

> roller2.plot(a=12, b=1)

> roller2.plot(a=-2,b=2.67)

10

0 2 4 6 8 10 12

020

4060

Roller weight (t)

Dep

ress

ion

in la

wn

(mm

)

Fitted valuesData values

+ve residual

−ve residual

a=14, b=0

0 2 4 6 8 10 12

020

4060

Roller weight (t)

Dep

ress

ion

in la

wn

(mm

)

Fitted valuesData values

+ve residual

−ve residual

a=2, b=2

0 2 4 6 8 10 12

020

4060

Roller weight (t)

Dep

ress

ion

in la

wn

(mm

)

Fitted valuesData values

+ve residual

−ve residual

a=12, b=1

0 2 4 6 8 10 12

020

4060

Roller weight (t)

Dep

ress

ion

in la

wn

(mm

)

Fitted valuesData values

+ve residual −ve residual

a=−2, b=2.67

◦ The first three lines above do not pass as close to the plottedpoints as the fourth, even though the sum of the residuals isabout the same in all four cases.◦ Negative residuals cancel out positive residuals.

The Key: minimize squared residuals source("roller3.plot.R")

> roller3.plot(14,0); roller3.plot(2,2); roller3.plot(12,1);> roller3.plot(a=-2,b=2.67) # small SS

0 2 4 6 8 10 12

040

8012

0

Roller weight (t)

Dep

ress

ion

in la

wn

(mm

)

Fitted valuesData values

+ve residual−ve residual

a=14, b=0

0 2 4 6 8 10 12

040

8012

0Roller weight (t)

Dep

ress

ion

in la

wn

(mm

)

Fitted valuesData values

+ve residual−ve residual

a=2, b=2

0 2 4 6 8 10 12

040

8012

0

Roller weight (t)

Dep

ress

ion

in la

wn

(mm

)

Fitted valuesData values

+ve residual

−ve residual

a=12, b=1

0 2 4 6 8 10 12

040

8012

0

Roller weight (t)

Dep

ress

ion

in la

wn

(mm

)

Fitted valuesData values

+ve residual −ve residual

a=−2, b=2.67

11

Unbiased estimate of σ2

1

n− # parameters estimated

n∑i=1

e2i

=1

n− 2

n∑i=1

e2i

* n observations⇒ n degrees of freedom

* 2 degrees of freedom are required to estimate the parameters

* the residuals retain n− 2 degrees of freedom

12

Alternative Viewpoint

* y = (y1, y2, . . . , yn) is a vector in n-dimensional space.(n degrees of freedom)

* The fitted values yi = β0 + β1xi also form a vector inn-dimensional space:y = (y1, y2, . . . , yn)

(2 degrees of freedom)* Least-squares seeks to minimize the distance between y and y.* The distance between n-dimensional vectors u and v is the

square root ofn∑i=1

(ui − vi)2

* Thus, the squared distance between y and y isn∑i=1

(yi − yi)2 =n∑i=1

e2i

13

Regression Coefficient Estimators

• The minimizers of

S(β0, β1) =n∑i=1

(yi − β0 − β1xi)2

are

β0 = y − β1x

and

β1 =Sxy

Sxx

where

Sxy =n∑i=1

(xi − x)(yi − y)

and

Sxx =n∑i=1

(xi − x)2

14

HomeMade R Estimators

> ls.est <-

function (data)

{

x <- data[,1]

y <- data[,2]

xbar <- mean(x)

ybar <- mean(y)

Sxy <- S(x,y,xbar,ybar)

Sxx <- S(x,x,xbar,xbar)

b <- Sxy/Sxx

a <- ybar - xbar*b

list(a = a, b = b, data=data)

}

> S <-

function (x,y,xbar,ybar)

{

sum((x-xbar)*(y-ybar))

}

15

Calculator Formulas and Other Properties

Sxy =n∑i=1

xiyi −(∑ni=1 xi)(

∑ni=1 yi)

n

Sxx =n∑i=1

x2i −

(∑ni=1 xi)

2

n

• Linearity Property:

Sxy =n∑i=1

yi(xi − x)

so β1 is linear in the yi’s.• Gauss-Markov Theorem: Among linear unbiased estimators forβ1 and β0, β1 and β0 are best (i.e. they have smallest variance).• Exercise: Find the expected value and variance of y1−y

x1−x. Is itunbiased for β1? Is there a linear unbiased estimator withsmaller variance?

16

Residuals

εi = ei = yi − β0 − β1xi

= yi − yi

= yi − y − β1(xi − x)

• In R:> res

function (ls.object)

{

a <- ls.object$a

b <- ls.object$b

x <- ls.object$data[,1]

y <- ls.object$data[,2]

resids <- y - a - b*x

resids

}

17

2.3.1 Consequences of Least-Squares

1. ei = yi − y − β1(xi − x)

(follows from intercept formula)2.∑ni=1 ei = 0 (follows from 1.)

3.∑ni=1 yi =

∑ni=1 yi

(follows from 2.)4. The regression line passes through the centroid (x, y) (follows

from intercept formula)5.∑xiei = 0

(set partial derivative of S(β0, β1) wrt β1 to 0)6.∑yiei = 0

(follows from 2. and 5.)

18

2.3.2 Estimation of σ2

• The residual sum of squares or error sum of squares is given by

SSE =n∑i=1

e2i = Syy − β1Sxy

Note: SSE = SSRes and Syy = SST

• An unbiased estimator for the error variance is

σ2 = MSE =SSE

n− 2=

[Syy − β1Sxy

]n− 2

Note: MSE = MSRes

• σ = Residual Standard Error =√

MSE

19

Example

• roller data

weight depression

1 1.9 2

2 3.1 1

3 3.3 5

4 4.8 5

5 5.3 20

6 6.1 20

7 6.4 23

8 7.6 10

9 9.8 30

10 12.4 25

20

Hand Calculation

• (y = depression, x = weight)

◦∑10i=1 xi = 1.9 + 3.1 + · · ·12.4 = 60.7

◦ x = 60.710 = 6.07

◦∑10i=1 yi = 2 + 1 + · · ·25 = 141

◦ y = 14110 = 14.1

◦∑10i=1 x

2i = 1.92 + 3.12 + · · ·+ 12.42 = 461

◦∑10i=1 y

2i = 4 + 1 + 25 + · · ·+ 625 = 3009

◦∑10i=1 xiyi = (1.9)(2) + · · · (12.4)(25) = 1103

◦ Sxx = 461− (60.7)2

10 = 92.6

21

◦ Sxy = 1103− (60.7)(141)10 = 247

◦ Syy = 3009− (141)2

10 = 1021

◦ β1 =SxySxx

= 24792.6 = 2.67

◦ β0 = y − β1x = 14.1− 2.67(6.07) = −2.11

◦ σ2 = 1n−2(Syy − β1Sxy)

= 18(1021− 2.67(247)) = 45.2 = MSE

• Example Summary: the fitted regression line relating depression(y) to weight (x) is

y = −2.11 + 2.67x

The error variance is estimated as MSE = 45.2.

R commands (home-made version)

> roller.obj <- ls.est(roller)

> roller.obj[-3] # intercept and slope estimate

$a

[1] -2.09

$b

[1] 2.67

> res(roller.obj) # residuals

[1] -0.98 -5.18 -1.71 -5.71 7.95

[6] 5.82 8.02 -8.18 5.95 -5.98

> sum(res(roller.obj)ˆ2)/8 # error variance (MSE)

[1] 45.4

22

R commands (built-in version)

> attach(roller)

> roller.lm <- lm(depression ˜ weight)

> summary(roller.lm)

Call:

lm(formula = depression ˜ weight)

Residuals:

Min 1Q Median 3Q Max

-8.180 -5.580 -1.346 5.920 8.020

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.0871 4.7543 -0.439 0.67227

weight 2.6667 0.7002 3.808 0.00518

23

Residual standard error: 6.735 on 8 degrees of freedom

Multiple R-squared: 0.6445, Adjusted R-squared: 0.6001

F-statistic: 14.5 on 1 and 8 DF, p-value: 0.005175

> detach(roller)

• or

> roller.lm <- lm(depression ˜ weight, data = roller)

> summary(roller.lm)

Using Extractor Functions

• Partial Output:

> coef(roller.lm)

(Intercept) weight

-2.09 2.67

> summary(roller.lm)$sigma

[1] 6.74

• From the output,

◦ the slope estimate is β1 = 2.67.

◦ the intercept estimate is β0 = −2.09.

◦ the Residual standard error is the square root of the MSE:6.742 = 45.4.

24

Other R commands

◦ fitted values: predict(roller.lm)

◦ residuals: resid(roller.lm)

◦ diagnostic plots: plot(roller.lm)(these include a plot of the residuals against the fitted valuesand a normal probability plot of the residuals)

◦ Also plot(roller); abline(roller.lm)

(this gives a plot of the data with the fitted line overlaid)

25

2.4: Properties of Least Squares Estimates

◦ E[β1] = β1

◦ E[β0] = β0

◦ Var(β1) = σ2

Sxx

◦ Var(β0) = σ2(

1n + x2

Sxx

)

26

Standard Error Estimators

◦ Var(β1) = MSESxx

so the standard error (s.e.) of β1 is estimated by√MSESxx

roller e.g.: MSE = 45.2, Sxx = 92.6

⇒ s.e. of β1 is√

45.2/92.6 = .699

◦ Var(β0) = MSE(

1n + x2

Sxx

)so the standard error (s.e.) of β0 is estimated by√√√√MSE

(1

n+

x2

Sxx

)roller e.g.:MSE = 45.2, Sxx = 92.6, x = 6.07, n = 10

⇒ s.e. of β0 is√

45.2(

110 + 6.072

92.62

)= 4.74

27

Distributions of β1 and β0

◦ yi is N(β0 + β1xi, σ2), and

β1 =∑ni=1 aiyi (ai = xi−x

Sxx)

⇒ β1 is N(β1,σ2

Sxx).

Also, SSEσ2 is χ2

n−2 (independent of β1) so

β1 − β1√MSE/Sxx

∼ tn−2

◦ β0 =∑ni=1 ciyi (ci = 1

n −x(xi−x)Sxx

)

⇒ β0 is N(β0, σ2(

1n + x2

Sxx

)) and

β0 − β0√MSE

(1n + x2

Sxx

) ∼ tn−2

28

2.5 Inferences about the Regression Parameters

• Tests

• Confidence Intervals

for β1, and β0 + β1x0

29

2.5.1 Inference for β1

H0 : β1 = β10 vs. H1 : β1 6= β10

Under H0,

t0 =β1 − β10√MSE/Sxx

has a t-distribution on n− 2 degrees of freedom.

p-value = P (|tn−2| > t0)

e.g. testing significance of regression for roller:

H0 : β1 = 0 vs. H1 : β1 6= 0

t0 =2.67− 0

.699= 3.82

p-value = P (|t8| > 3.82) = 2(1− P (t8 < 3.82)) = .00509

R command: 2*(1-pt(3.82,8))

30

(1− α) Confidence Intervals

◦ slope:

β1 ± tn−2,α/2s.e.

or

β1 ± tn−2,α/2

√MSE/Sxx

roller e.g. (95% confidence interval)

2.67± t8,.025(.699)

or

2.67± 2.31(.699) = 2.67± 1.61

R command: qt(.975, 8)

◦ intercept:

β0 ± tn−2,α/2s.e.

e.g. − 2.11± 2.31(4.74) = −2.11± 10.9

31

2.5.2 Confidence Interval for Mean Response

◦ E[y|x0] = β0 + β1x0

where x0 is a possible value that the predictor could take.

◦ E[y|x0] = β0 + β1x0 is a point estimate of the mean response atthat value.

◦ To find a 1− α confidence interval for E[y|x0], we need thevariance of y0 = E[y|x0]:

Var(y0) = Var(β0 + β1x0)

β0 =n∑i=1

ciyi

β1x0 =n∑i=1

bix0yi

32

Therefore,

y0 =n∑i=1

(ci + bix0)yi

so

Var(y0) =n∑i=1

(ci + bix0)2Var(yi)

= σ2n∑i=1

(ci + bix0)2

= σ2n∑i=1

(1

n+

(x0 − x)(xi − x)

Sxx)2

= σ2(1

n+

(x0 − x)2

Sxx)

◦ Var(y0) = MSE(1n + (x0−x)2

Sxx)

◦ The confidence interval is then

y0 ± tn−2,α/2

√√√√MSE(1

n+

(x0 − x)2

Sxx)

◦ e.g. Compute a 95% confidence interval for the expecteddepression when the weight is 5 tonnes.

x0 = 5, y0 = −2.11 + 2.67(5) = 11.2

n = 10, x = 6.07, Sxx = 92.6,MSE = 45.2

Interval:

11.2± t8,.025

√45.2(

1

10+

(5− 6.07)2

92.6)

= 11.2± 2.31√

5.08 = 11.2± 5.21

= (5.99,16.41)

◦ R code:

> predict(roller.lm,

newdata=data.frame(weight<-5),

interval="confidence")

fit lwr upr

[1,] 11.2 6.04 16.5

◦ Ex. Write your own R function to compute this interval.

2.6: Predicted Responses

◦ If a new observation were to be taken at x0, we would predict itto be y0 = β0 + β1x0 + ε where ε is independent noise.

◦ y0 = β0 + β1x0 is a point prediction of the response at thatvalue.

◦ To find a 1− α prediction interval for y0, we need the variance ofy0 − y0:

Var(y0 − y0) = Var(β0 + β1x0 + ε− β0 − β1x0)

= Var(−β0 − β1x0) + Var(ε)

= σ2(1

n+

(x0 − x)2

Sxx) + σ2

33

= σ2(1 +1

n+

(x0 − x)2

Sxx)

◦ Var(y0 − y0) = MSE(1 + 1n + (x0−x)2

Sxx)

◦ The prediction interval is then

y0 ± tn−2,α/2

√√√√MSE(1 +1

n+

(x0 − x)2

Sxx)

◦ e.g. Compute a 95% prediction interval for the depression for asingle new observation where the weight is 5 tonnes.

x0 = 5, y0 = −2.11 + 2.67(5) = 11.2

n = 10, x = 6.07, Sxx = 92.6,MSE = 45.2

Interval:

11.2± t8,.025

√45.2(1 +

1

10+

(5− 6.07)2

92.6)

= 11.2± 2.31√

50.3 = 11.2± 16.4 = (−5.2,27.6)

R code for Prediction Intervals

> predict(roller.lm,

newdata=data.frame(weight<-5),

interval="prediction")

fit lwr upr

[1,] 11.2 -5.13 27.6

◦ Write your own R function to produce this interval.

34

Degrees of Freedom

• A random sample of size n coming from a normal populationwith mean µ and variance σ2 has n degrees of freedom:Y1, . . . , Yn.• Each linearly independent restriction reduces the number of

degrees of freedom by 1.•∑ni=1

(Yi−µ)2

σ2 has a χ2(n) distribution.

•∑ni=1

(Yi−Y )2

σ2 has a χ2(n−1) distribution. (Calculating Y imposes

one linear restriction.)

•∑ni=1

(Yi−Yi)2

σ2 has a χ2(n−2) distribution. (Calculating β0 and β1

imposes two linearly independent restrictions.)• Given the xi’s, there are 2 degrees of freedom in the quantityβ0 + β1xi.• Given xi and Y , there is one degree of freedom:yi − Y = β1(xi − x).

35

2.7: Analysis of Variance: Breaking down (analyzing) Variation

◦ The variation in the data (responses) is summarized by

Syy =n∑i=1

(yi − y)2 = TSS

(Total sum of squares)

◦ 2 sources of variation in the responses:

1. variation due to the straight line relationship with thepredictor

2. deviation from the line (noise)

◦ yi−ydeviation from data center = yi−yi

residual + yi−ydifference: line and center

yi − y = ei + yi − y36

so

Syy =∑

(yi − y)2

=∑

(ei + yi − y)2

=∑

e2i +

∑(yi − y)2

since ∑eiyi = 0 and y

∑ei = 0

Syy = SSE +∑

(yi − y)2 = SSE + SSR

The last term is the regression sum of squares.

Relation between SSR and β1

◦ We saw earlier that

SSE = Syy − β1Sxy

◦ Therefore,

SSR = β1Sxy = β12Sxx

Note that, for a given set of x’s SSR depends only on β1.

MSR = SSR/d.f. = SSR/1

(1 degree of freedom for slope parameter)

37

Expected Sums of Squares

E[SSR] = E[Sxxβ12]

= Sxx(Var(β1) + (E[β1])2

)

= Sxx

(σ2

Sxx+ β2

1

)

= σ2 + β21Sxx

= E[MSR]

Therefore, if β1 = 0, then SSR is an unbiased estimator for σ2.◦ Development of E[MSE]:

E[Syy] = E[∑

(yi − y)2]

= E[∑

y2i − ny

2] =∑

E[y2i ]− nE[y2]

38

Development of E[MSE] (cont’d)

Consider the 2 terms on RHS, separately:

1. term:

E[y2i ] = Var(yi) + (E[yi])

2

= σ2 + (β0 + β1xi)2

∑E[y2

i ] = nσ2 + nβ20 + 2nβ0β1x+

∑β2

1x2i

2. term:

E[y2] = Var(y) + (E[y])2

= σ2/n+ (β0 + β1x)2

nE[y2] = σ2 + nβ20 + 2nβ0β1x+ nβ2

1x2

39

Development of E[MSE] (cont’d)

E[Syy] = (n− 1)σ2 + β21

∑(xi − x)2

= E[SST ]

E[SSE] = E[Syy]− E[SSR]

= (n− 1)σ2 + β21

∑(xi − x)2 − (σ2 + β2

1Sxx)

= (n− 2)σ2

E[MSE] = E[SSE/(n− 2)] = σ2

40

Another approach to testing H0 : β1 = 0

◦ Under the null hypothesis, both MSE and MSR estimate σ2.

◦ Under the alternative, only MSE estimates σ2.

E[MSR] = σ2 + β21Sxx > σ2

◦ A reasonable test is

F0 =MSRMSE

∼ F1,n−2

◦ Large F0 ⇒ evidence against H0.

◦ Note t2ν = F1,ν so this is really the same test as

t20 =

β1√MSE/Sxx

2

=β1

2Sxx

MSE=

MSRMSE

41

The ANOVA table

Source df SS MS FReg. 1 β1

2Sxx β12Sxx

MSRMSE

Error n− 2 Syy − β12Sxx SSE/(n− 2)

Total n-1 Syyroller data example:> anova(roller.lm) # R code

Analysis of Variance Table

Response: depression

Df Sum Sq Mean Sq F value Pr(>F)

weight 1 658 658 14.5 0.0052

Residuals 8 363 45

(recall that the t-statistic for testing β1 = 0 had been 3.81 =√14.5)

◦ Ex. Write an R function to compute these ANOVA quantities.

42

Confidence Interval for σ2

◦ SSEσ2 ∼ χ2

n−2so

P (χ2n−2,1−α/2 ≤

SSE

σ2≤ χ2

n−2,α/2) = 1− α

so

P (SSE

χ2n−2,α/2

≤ σ2 ≤SSE

χ2n−2,1−α/2

) = 1− α

e.g. roller data:

SSE = 363

χ28,.975 = 2.18

(R code: 1-qchisq(8, .025))

χ28,.025 = 17.5

(363/17.5,363/2.18) = (20.7,166.5)

43

2.7.1 R2 - Coefficient of Determination

◦ R2 = is the fraction of the response variability explained by theregression:

R2 =SSRSyy

◦ 0 ≤ R2 ≤ 1. Values near 1 imply that most of the variability isexplained by the regression.

◦ roller data:

SSR = 658 and Syy = 1021

so

R2 =658

1021= .644

44

R output

> summary(roller.lm)

...

Multiple R-Squared: 0.644, ...

◦ Ex. Write an R function which computes R2.

◦ Another interpretation:

E[R2].

=E[SSR]

E[Syy]=

β21Sxx + σ2

(n− 1)σ2 + β21Sxx

=β2

1Sxxn−1 + σ2

n−1

σ2 + β21Sxxn−1

.=

β21

Sxx(n−1)

σ2 + β21

Sxx(n−1)

for large n. (Note: this differs from the textbook.)

45

Properties of R2

• Thus, R2 increases as

1. Sxx increases (x’s become more spread out)

2. σ2 decreases

◦ Cautions

1. R2 does not measure the magnitude of the regression slope.

2. R2 does not measure the appropriateness of the linear model.

3. A large value of R2 does not imply that the regression modelwill be an accurate predictor.

46

Hazards of Regression

• Extrapolation: predicting y values outside the range of observedx values. There is no guarantee that a future response wouldbehave in the same linear manner outside the observed range.e.g. Consider an experiment with a spring. The spring isstretched to several different lengths x (in cm) and the restoringforce F (in Newtons) is measured:

x F

3 5.1

4 6.2

5 7.9

6 9.5

> spring.lm <- lm(F˜x - 1,data=spring)

> summary(spring.lm)

Coefficients:47

Estimate Std. Error t value Pr(>|t|)

x 1.5884 0.0232 68.6 6.8e-06

The fitted model relating F to x is

F = 1.58x

Can we predict the restoring force for the spring, if it has beenextended to a length of 15 cm?

• High leverage observations: x values at the extremes of therange have more influence on the slope of the regression thanobservations near the middle of the range.

• Outliers can distort the regression line. Outliers may beincorrectly recorded OR may be an indication that the linearrelation or constant variance assumption is incorrect.

• A regression relationship does not mean that there is acause-and-effect relationship. e.g. The following data give thenumber of lawyers and number of homicides in a given year for anumber of towns:

no. lawyers no. homicides

1 0

2 0

7 2

10 5

12 6

14 6

15 7

18 8

Note that the number of homicides increases with the number oflawyers. Does this mean that in order to reduce the number ofhomicides, one should reduce the number of lawyers?

• Beware of nonsense relationships. e.g. It is possible to showthat the area of some lakes in Manitoba is related to elevation.Do you think there is a real reason for this? Or is the apparentrelation just a result of chance?

2.9 - Regression through the Origin

◦ intercept = 0

yi = β1xi + ε

◦ Max. Likelihood and L-S⇒

minimizen∑i=1

(yi − β1xi)2

β1 =

∑xiyi∑x2i

ei = yi − β1xi

SSE =∑

e2i

◦ Max. Likelihood⇒

σ2 =SSE

n

48

◦ Unbiased Estimator:

σ2 = MSE =SSE

n− 1

◦ Properties of β1:

E[β1] = β1

Var(β1) =σ2∑x2i

◦ 1− α C.I. for β1:

β1 ± tn−1,α/2

√√√√MSE∑x2i

◦ 1− α C.I. for E[y|x0]:

y0 ± tn−1,α/2

√√√√MSEx20∑

x2i

since

Var(β1x0) =σ2∑x2i

x20

◦ 1− α P.I. for y, given x0:

y0 ± tn−1,α/2

√√√√MSE(1 +x2

0∑x2i

)

◦ R code:> roller.lm <- lm(depression˜weight - 1,

data=roller)

> summary(roller.lm)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

weight 2.392 0.299 7.99 2.2e-05

Residual standard error: 6.43

on 9 degrees of freedom

Multiple R-Squared: 0.876,

Adjusted R-squared: 0.863

F-statistic: 63.9 on 1 and 9 DF,

p-value: 2.23e-005

> predict(roller.lm,

newdata=data.frame(weight<-5),

interval="prediction")

fit lwr upr

[1,] 12.0 -2.97 26.9

Ch. 2.9.2 Correlation

• Bivariate Normal Distribution:

f(x, y) =1

2πσ1σ2

√1− ρ2

e− 1

2(1−ρ2)(Y 2−2ρXY+X2)

where

X =x− µ1

σ1

and

Y =y − µ2

σ2

◦ correlation coefficient:

ρ =σ12

σ1σ2

◦ µ1 and σ21 are the mean and variance of x

◦ µ2 and σ22 are the mean and variance of y

◦ σ12 = E[(x− µ1)(y − µ2)], the covariance

49

Conditional distribution of y given x

f(y|x) =1√

2πσ1.2e−1

2

(y−β0−β1x

σ1.2

)2

where

β0 = µ2 − µ1ρσ2

σ1

β1 =σ2

σ1ρ

σ21.2 = σ2

2(1− ρ2)

50

An Explanation

:

◦ Suppose

y = β0 + β1x+ ε

where ε and x are independent N(0, σ2) and N(µ1, σ21) random

variables.◦ Define µ2 = E[y] and σ2

2 = Var(y):

µ2 = β0 + β1µ1

σ22 = β2

1σ21 + σ2

◦ Define σ12 = Cov(x− µ1, y − µ2):

σ12 = E[(x− µ1)(y − µ2)]

= E[(x− µ1)(β1x+ ε− β1µ1)]

= β1E[(x− µ1)2] + E[(x− µ1)ε]

= β1σ21

51

◦ Define ρ = σ12σ1σ2

:

ρ =β1σ

21

σ1σ2= β1

σ1

σ2

Therefore,

β1 =σ2

σ1ρ

and

β0 = µ2 − β1µ1

= µ2 − µ1σ2

σ1ρ

◦ What is the conditional distribution of y, given x = x?

y|x=x = β0 + β1x+ ε

must be normal with mean

E[y|x = x] = β0 + β1x

= µ2 + ρσ2

σ1(x− µ1)

and variance

Var(y|x = x) = σ21.2 = σ2 = σ2

2 − β21σ

21

= σ22 − ρ

2σ22

σ21σ2

1

= σ22(1− ρ2)

Estimation

◦ Maximum Likelihood Estimation (using bivariate normal) gives

β0 = y − β1x

β1 =Sxy

Sxx

r = ρ =Sxy√SxxSyy

◦ Note that

r = β1

√Sxx

Syy

so

r2 = β12Sxx

Syy= R2

52

i.e. the coefficient of determination = square of correlationcoefficient

Testing H0 : ρ = 0 vs. H1 : ρ 6= 0

◦ Cond’l approach (equiv. to testing β1 = 0):

t0 =β1√

MSE/Sxx

=β1√

Syy−β12Sxx

(n−2)Sxx

=

√√√√√β12(n− 2)

SyySxx− β1

2

=

√√√√(n− 2)r2

1− r2

where we have used β12 = r2Syy/Sxx.

53

◦ The above statistic is a t statistic on n− 2 degrees of freedom⇒conclude ρ 6= 0 if pvalue

P (|tn−2| > |t0|) is small.

Testing H0 : ρ = ρ0

Z =1

2log

1 + r

1− rhas an approximate normal distribution with mean

µZ =1

2log

1 + ρ

1− ρand variance

σ2Z =

1

n− 3

for large n.

◦ Thus,

Z0 =log 1+r

1−r − log 1+ρ01−ρ0

2√

1/(n− 3)

has an approximate standard normal distribution when the nullhypothesis is true.

54

Confidence Interval for ρ

• Confidence interval for 12 log 1+ρ

1−ρ :

Z ± zα/2

√1/(n− 3)

• Find endpoints (l, u) of this confidence interval and solve for ρ:

(e2l − 1

1 + e2l,e2u − 1

1 + e2u)

55

R code for fossum example

• Find 95% confidence interval for the correlation between totallength and head length:

> source("fossum.R")

> attach(fossum)

> n <- length(totlngth) # sample size

> r <- cor(totlngth,hdlngth)# correlation est.

> zci <- .5*log((1+r)/(1-r)) +

qnorm(c(.025,.975))*sqrt(1/(n-3))

> ci <- (exp(2*zci)-1)/(1+exp(2*zci))

> ci # transformed conf. interval

> detach(fossum)

[1] 0.62 0.83 # conf. interval for true

# correlation

56

Recommended