16
Stat 112: Notes 2 • This class: Start Section 3.3. • Thursday’s class: Finish Section 3.3. • I will e-mail and post on the web site the first homework tonight. It will be due next Thursday.

Stat 112: Notes 2 This class: Start Section 3.3. Thursday’s class: Finish Section 3.3. I will e-mail and post on the web site the first homework tonight

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Stat 112: Notes 2

• This class: Start Section 3.3.

• Thursday’s class: Finish Section 3.3.

• I will e-mail and post on the web site the first homework tonight. It will be due next Thursday.

Father and Son’s Heights• Francis Galton was interested in the

relationship between – Y=son’s height– X=father’s height

• Galton surveyed 952 father-son pairs in 19th Century England.

• Data is in Galton.JMP

Bivariate Fit of sons ht By father ht

61

63

65

67

69

71

73

75

sons

ht

63 64 65 66 67 68 69 70 71 72 73 74

father ht

Simple Linear Regression Model Bivariate Fit of sons ht By father ht

61

63

65

67

69

71

73

75

sons

ht

63 64 65 66 67 68 69 70 71 72 73 74

father ht

Linear Fit sons ht = 26.455559 + 0.6115222 father ht Summary of Fit RSquare 0.177196 RSquare Adj 0.17633 Root Mean Square Error 2.356912

The mean of son’s height given father’s height is estimated to increase 1 0.61b inches for each

one inch increase in father’s height. The average absolute error from using the regression line to predict a son’s height based on the father’s height is about equal to the RMSE = 2.36.

Sample vs. Population

• We can view the data – -- as a sample from a population.

• Our goal is to learn about the relationship between X and Y in the population: – We don’t care about how father’s heights and son’s

heights are related in the particular 952 men sampled but among all fathers and sons.

– From Notes 1, we don’t care about the relationship between tracks counted and the density of deer for the particular sample, but the relationship among the population of all tracks; this enables to predict in the future the density of deer from the number of tracks counted.

1 1( , ), , ( , )n nX Y X Y

Simple Linear Regression ModelThe simple linear regression model:

0 1i i iY x e .

The ie are called disturbances and it is assumed that 1. Linearity assumption: The conditional expected value

of the disturbances given ix is zero, ( ) 0iE e , for

each i. This implies that 0 1( | )E Y x x so that the expected value of Y given X is a linear function of X.

2. Constant variance: assumption: The disturbances

ie are assumed to all have the same variance 2 .

3. Normality assumption: The disturbances ie are assumed to have a normal distribution.

4. Independence assumption: The disturbances ie are assumed to be independent. This is an assumption that is most important when the data are gathered over time. When the data are cross-sectional (that is, gathered at the same point in time for different individual units), this is typically not an assumption of concern.

Checking the AssumptionsSimple Linear Regression Model for Population:

0 1i i iY x e . Before making any inferences using the simple linear regression model, we need to check the assumptions:

Based on the data 1 1 ,( , ), , ( )n nX Y X Y ,

1. We estimate 0 and 1 by the least squares estimates

0b and 1b .

2. We estimate the disturbances ie by the residuals

0 1ˆˆ ( | ) ( )i i i i ie Y E Y X Y b b X .

3. We check if the residuals approximate satisfy

(1) Linearity: ˆ( ) 0iE e for all range of iX .

(2) Constant Variance: ˆ( )iVar e constant for all

range of iX .

(3) Normality: ie are approximately normally distributed.

(4) Independence : ie are independent (only worry about for time series data).

Residual Plot A useful tool for checking the assumptions is the residual plot. Residual for observation i

0 1ˆˆ ( | ) ( )i i i i i ie y E y x y b b x .

The residual plot is a plot of the residuals ie versus ix . It is constructed in JMP by after fitting the least squares line, clicking the red triangle next to Linear Fit and clicking Plot Residuals.

0

Res

idua

l

63 64 65 66 67 68 69 70 71 72 73 74

father ht

Checking Linearity Assumption

To check if the linearity assumption holds (i.e., the model

for the mean is correct), check if ˆ( )iE e is zero for each

range of iX .

0

Res

idua

l

63 64 65 66 67 68 69 70 71 72 73 74

father ht

Linearity Assumption appears reasonable except for the largest father’s heights (residuals are all positive).

Violation of LinearityFor a sample of McDonald’s restaurants Y=Revenue of Restaurant X=Mean Age of Children in Neighborhood of Restaurant Bivariate Fit of Revenue By Age

800

900

1000

1100

1200

1300

Rev

enue

2.5 5.0 7.5 10.0 12.5 15.0

Age

-200

-100

0

100

200

300

Res

idua

l2.5 5.0 7.5 10.0 12.5 15.0

Age

The mean of the residuals is negative for small and large ages and positive for large ages – linearity appears to be violated (we will see what to do when linearity is violated in Chapter 5).

Checking Constant VarianceTo check that the constant variance assumption holds, check that there is no pattern in the spread of the residuals as X varies.

0

Res

idua

l

63 64 65 66 67 68 69 70 71 72 73 74

father ht

Constant variance appears reasonable for the father-son height data.

Checking NormalityFor checking normality, we can look at whether the overall distribution of the residuals looks approximately normal by making a histogram of the residuals. Save the residuals by clicking the red triangle next to Linear Fit after Fit Line and then clicking Save Residuals. Then click Analyze, Distribution and put the saved residuals column into Y, Columns. The histogram should be approximately bell shaped if the normality assumption holds. Distributions Residuals sons ht

-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6

The residuals from the father-son height data have approximately a bell shaped histogram although there is some indication of skewness to the left. The normality assumption seems reasonable. We will look at more formal tools for assessing normality in Chapter 6.

InferencesSimple Linear Regression Model for Population:

0 1i i iY x e .

Data: 1 1( , ), , ( , )n nX Y X Y .

The least squares estimates 0b and 1b will typically not be

exactly equal to the true 0 and 1 .

Inferences: Draw conclusions about 0 and 1 based on the

data 1 1( , ), , ( , )n nX Y X Y .

Point Estimates: Best estimates of 0 and 1 . Confidence intervals: Ranges of plausible values of

0 and 1 . Hypothesis tests: Test whether it is plausible that

0 and 1 equal certain values.

Sampling Distribution of b0,b1

• Utopia.JMP contains simulations of pairs and from a simple

linear regression model with • Notice the difference in the estimated

coefficients from the y’s and y*’s.• The sampling distribution of describes the

probability distribution of the estimates over repeated samples from the simple linear regression model with fixed.

),(,),,( 11 nn yxyx ),(,),,( **11 nn yxyx

1,1,1 210 e

10 ,bb

),,( 1 nyy ),,( 1 nxx

10 ,bb

Utopia

0

5

10

15

20

25

30

35

y

0 5 10 15 20 25 30 35

x

Linear Fity = 1.4977506 + 0.9876713 xParameter EstimatesTerm Estimate Std Error t Ratio Prob>|t|Intercept 1.4977506 0.300146 4.99 <.0001x 0.9876713 0.016907 58.42 <.0001

Linear Fity* = 0.9469452 + 1.0216591 xParameter EstimatesTerm Estimate Std Error t Ratio Prob>|t|Intercept 0.9469452 0.364246 2.60 0.0147x 1.0216591 0.020517 49.79 <.0001

0

5

10

15

20

25

30

35

y*

0 5 10 15 20 25 30 35

x

Sampling distributions

• Sampling distribution of – – – Sampling distribution is normally distributed.

• Sampling distribution of – –

– Sampling distribution is normally distributed.– Even if the normality assumption fails and the errors e are not

normal, the sampling distributions of are still approximately normal if n>30.

0b

00)( bE)

)1(

1()( 2

22

0

x

esn

x

nbVar

1b

11)( bE

2

2

1)1(

)(x

e

snbVar

0 1,b b

Properties of and as estimators of and

• Unbiased Estimators: Mean of the sampling distribution is equal to the population parameter being estimated.

• Consistent Estimators: As the sample size n increases, the probability that the estimator will become as close as you specify to the true parameter converges to 1.

• Minimum Variance Estimator: The variance of the estimator is smaller than the variance of any other linear unbiased estimator of , say

0b 1b

0 1

1b

1 *1b