Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Chapter 14Inference for Regression
Lesson 14-1, Part 1Inference for Regression
Review Least-Square RegressionA family doctor is interested in examining the relationship betweenpatient’s age and total cholesterol. He randomly selects 14 of his femalepatients and obtains the data present in Table 1. The data are based uponresults obtained from the National Center for Health Statistics.
Table 1
Age Total Cholesterol Age Total Cholesterol
25 180 42 183
25 195 48 204
28 186 51 221
32 180 51 243
32 210 58 208
32 197 62 228
38 239 65 269
Review Least-Square Regression
1. What is the least-square regression line for
predicting total cholesterol from age for women?
The least square regression equation is ŷ = 151.3537 + 1.3991x, where ŷ represents the predicted total cholesterol for a female who age is x.
Review Least-Square Regression2. What is the correlation coefficient between age and
cholesterol? Interpret the correlation coefficient in the
context of the problem
The linear correlation coefficient is 0.718. There is a moderate, positive linear relationship between female age and total cholesterol.
Review Least-Square Regression3. What is the predicted cholesterol level of 67 year old
female?
ˆ 151.3537 1.399
151.3537 1.3991( )
151.3537 1.3991(67)
245
y x
cholesterol age
Review Least-Square Regression4. Interpret the slope of the regression line in the context of
the problem?
For each increase in age of one year, the total
cholesterol is predicted to increases by 1.3991.
Statistics and Parameters
• When doing inference for regression, we use to estimate the population regression line.▫ a and b are estimators of population parameters α and
β, the intercept and slope of the population regression line.
y a bx
Conditions
• The conditions necessary for doing inference for regression are:▫ For each given value of x, the values of the response
variable y-values are independent and normally distributed.
▫ For each value of x, the standard deviation, σ, of y-values is the same.
▫ The mean response of the y values for the fixed values of x are linearly related by the equation μy = α + βx
Standard Error of the Regression Line
• Gives the variability of the vertical distances of the y-values from the regression line
• Remember that a residual was the error involved when making a prediction from the regression equation
• The spread around the line is measured with the standard deviation of the residual, s.
2 2
ˆ
2 2i iy y residuals
sn n
Standard Error of the Slope of the Regression Line
• Gives the variability of the estimates of the slope of the regression line
2
2 2
ˆ
2i i
b
i i
y y
s nSEx x x x
Summary• Inference for regression depends upon estimating μy = α + βx with ŷ = a + bx
• For each x, the response values of y are independent and follow a normal distribution, each distribution having the same standard deviation.
• Inference for regression depends on the following statistics:▫ a, the estimate of the y intercept, α, of μy
▫ b, the estimate of the slope, β, of μy
▫ s, the standard error of the residuals
▫ SEb the standard error of the slope of the regression line.
Computing Standard Error of the Residual
Age, x Total Cholesterol, y
ŷ = 151.3537 + 1.3991x Residuals(y – ŷ)
Residuals2
(y – ŷ)2
25 180 186.33 -6.33 40.0689
25 195 186.33 8.67 75.1689
28 186 190.53 -4.53 20.5209
32 180 196.12 -16.12 259.8544
32 210 196.12 13.88 192.6544
32 197 196.12 0.88 0.7744
38 239 204.52 34.48 1188.8704
62 228 238.10 -10.10 102.01
65 269 242.30 26.70 712.89
Σ residuals2 = 4553.708
Computing Standard Error
24553.705
19.482 14 2
residualsS
n
Example – Page 787, #14.2Body weights and backpack weights were collected for eight
students
Weight (lbs)
120 187 109 103 131 165 158 116
Backpack weight (lbs)
26 30 26 24 29 35 31 28
These data were entered into a statistical package and least-
squares regression of backpack weight on body weight as
requested. Here are the results.
Example – Page 787, #14.2
Predictor Coef Stdev t-ratio p
Constant 16.265 3.937 4.13 0.006
BodyWT 0.09080 0.02831 3.21 0.018
S = 2.270 R-sq = 63.2% R-sq(adj) = 57.0%
A) What is the equation of the least-square line?
Backpack weight = 16.265 + 0.09080(bodyweight)
Example – Page 787, #14.2Predictor Coef Stdev t-ratio p
Constant 16.265 3.937 4.13 0.006
BodyWT 0.09080 0.02831 3.21 0.018
S = 2.270 R-sq = 63.2% R-sq(adj) = 57.0%
B) The model for regression inference has three parameters,
which we call α, β and σ. Can you determine the
estimates for α and β from the computer printout?
a = 16.265 estimates the true intercept α and b = 0.09080
estimates the true slope β.
Example – Page 787, #14.2Predictor Coef Stdev t-ratio p
Constant 16.265 3.937 4.13 0.006
BodyWT 0.09080 0.02831 3.21 0.018
S = 2.270 R-sq = 63.2% R-sq(adj) = 57.0%
C) The computer output reports that s = 2.270. This is an
estimate of the parameter σ. Use the formula for s to
verify the computer’s value of s.
Use your TI to verify this.
Example – Page 788, #14.4Exercise 3.71 on page 187 provided data on the speed of
competitive runners and the number of steps they took per
second. Good runners take more steps per second as they
speed up. Here is the data again.
15.86 16.88 17.50 18.62 19.97 21.06 22.11
3.05 3.12 3.17 3.25 3.36 3.46 3.55
speed
steps
A)Enter the data into your calculator, perform least-square
regression, and plot the scatterplot with the least-square
line. What is the strength of the association between
speed and steps per second?
Example – Page 788, #14.4
Steps = 1.77 + 0.0803(speed). There is a very strong
positive linear relationship between speed and steps; r = 0.999.
nearly all the variation (r2 = 0.998) 99.8% of it in steps per
second is explained by the linear relationship.
Example – Page 788, #14.4
speed (feet per second)
ste
ps p
er
se
con
d
Example – Page 788, #14.4C) The model for regression inference has three parameters,
α, β and σ. Estimate these parameters from the data
a = 1.766 is the estimate of α
b = 0.0803 is the estimate of β
s = 0.0091 is the estimate of σ
Lesson 14-1, Part 2Inference for Regression
Significance Test for the Slope of a Regression Line
• We want to test whether the slope of the regression line is zero or not.▫ If the slope of the line is zero, then there is no linear
relationship between x and y variables.
▫ Remember (formula for b) if r = 0, then b = 0
• Hypothesis▫ Two Tailed: Ho: β = 0 and Ha: β ≠ 0
▫ Left Tailed: Ho: β = 0 and Ha: β < 0
▫ Right Tailed: Ho: β = 0 and Ha: β > 0
Test Statistics and Confidence Interval
• t distribution with n – 2 degrees of freedom
• SEb = Standard error of the slope
b b
b β bt
SE SE *
bb t SE
2b
i
sSE
x x
Reading Computer Printouts
Example – Page 794, #14.6Exercise 14.1 (page 786) presents data on the lengths of two
bones in five fossil specimens of the extinct beast
Archaeopteryx. Here is part of the output from the S-PLUS
statistical software when we regress the length y of the
humerus on the length x of the femur.
Coefficients
Value Std Error t value Pr(>|t|)
(Intercepts) – 3.6596 4.4590 – 0.8207 0.4719
Femur 1.1969 0.0751
Example – Page 794, #14.6
A) What is the equation of the least-squares regression line?
Coefficients
Value Std Error t value Pr(>|t|)
(Intercepts) – 3.6596 4.4590 – 0.8207 0.4719
Femur 1.1969 0.0751
3.6596 1.1969( )humerus femur
Example – Page 794, #14.6
B) We left out the t statistic for testing Ho: β = 0 and its
P-value. Use the output to find t.
Coefficients
Value Std Error t value Pr(>|t|)
(Intercepts) – 3.6596 4.4590 – 0.8207 0.4719
Femur 1.1969 0.0751
1.1969
15.940.0751b
bt
S
Example – Page 794, #14.6C)How many degrees of freedom does t have? Use Table C
to approximate the P-value of t against the one-sided
alternative Ha: β > 0.
df = 3; since t > 12.92, we know P-value < 0.0005
4(15.9374, 99,3) 2.685 10tcdf E
Example – Page 794, #14.6
D)Write a sentence to describe your conclusion about the
slope of the true regression line.
There is very strong evidence that β > 0, that is, that
the line is useful for predicting the length of the
humerus given the length of the femur
Example – Page 794, #14.6
E)Determine a 99% confidence interval for the true slope
of the regression line.
Example – Page 794, #14.6
1.1969 5.841(0.0751)
(0.758,1.636)
*bb t S
Example – Page 794, #14.8There is some evidence that drinking moderate amounts
of wine helps prevent heart attacks. Exercise 3.63 (Page 183)
gives data on yearly wine consumption (liters of alcohol from
drinking wine, per person) and yearly deaths from heart
disease (deaths per 100,000 people) in 19 developed
nations.
A) Is there statistically significant evidence of a negative
association between wine consumption and heart disease
deaths? Carry out the appropriate test of significance and
write a summary statement about your conclusions.
Example – Page 794, #14.8
Example – Page 794, #14.8
: 0
: 0
o
a
H β
H β
β = negative association between wine consumption
and heart disease deaths.
Example – Page 794, #14.8Linear Regression T-test
Condition
1. The observations are independent
2. The true relationship is linear (check scatterplot to check
that the overall pattern is linear or plot of residuals
against the predicted values)
3. The standard deviation of the response about the true
line is the same everywhere (make sure the spread
around the line is nearly constant)
4. The response varies normally about the true regression
line (normal probability plot of residuals is quite straight)
Example – Page 794, #14.8
22.976.47
3.357b
bt
S
2( )
b
sS
x x
62.96 10
0.0005
p value
P value
Reject Ho, since p-value = 0.0005 < = 0.05 and conclude
that there a linear relationship between wine consumption
and heart disease deaths.
Example – Page 795, #14.10Exercise 14.4 (page 788) presents data on the relationship
between the speed of runners (x, in feet per second) and
the number of steps y that they take in a second. Here
is part of the Data Desk Regression output for these data:
R squared = 99.8%
s = 0.0091 with 7 – 2 = 5 degrees of freedom
Variable Coefficient s.e. of Coeff t-ratio prob
Constant 1.76608 0.0307 57.6 <0.0001
speed 0.080284 0.0016 49.7 <0.0001
Example – Page 795, #14.10
R squared = 99.8%
s = 0.0091 with 7 – 2 = 5 degrees of freedom
Variable Coefficient s.e. of Coeff t-ratio prob
Constant 1.76608 0.0307 57.6 <0.0001
speed 0.080284 0.0016 49.7 <0.0001
A)How can you tell from this output, even without the
scatterplot, that there is a very strong straight-line
relationship between running speed and steps per second?
Example – Page 795, #14.10
R squared = 99.8%
s = 0.0091 with 7 – 2 = 5 degrees of freedom
Variable Coefficient s.e. of Coeff t-ratio prob
Constant 1.76608 0.0307 57.6 <0.0001
speed 0.080284 0.0016 49.7 <0.0001
r2 is very close to 1, which means that nearly all the variation
in steps per second is accounted for by foot speed. Also, the
P-value for β is small.
Example – Page 795, #14.10
R squared = 99.8%
s = 0.0091 with 7 – 2 = 5 degrees of freedom
Variable Coefficient s.e. of Coeff t-ratio prob
Constant 1.76608 0.0307 57.6 <0.0001
speed 0.080284 0.0016 49.7 <0.0001
B) What parameter in the regression model gives the rate at
which steps per second increase as running speed
increases? Give a 99% confidence interval for this rate.
Example – Page 795, #14.10R squared = 99.8%
s = 0.0091 with 7 – 2 = 5 degrees of freedom
Variable Coefficient s.e. of Coeff t-ratio prob
Constant 1.76608 0.0307 57.6 <0.0001
speed 0.080284 0.0016 49.7 <0.0001
β (the slope) is this rate; the estimate is listed as coeffincient
of “Speed,” 0.080284.
* 0.080284 4.032(0.0016) (0.074,0.087)bb t S
Lesson 14-2, Part 1Predictions and Conditions
Confidence Intervals• Write the given value of the explanatory variable x
as x*.▫ The distinction between predicting a single outcome
and predicting the mean of all outcomes when x = x* determines what margin of error is correct.
• Estimate the mean response, we use a confidence interval.▫ µy = α + βx*
• Estimate an individual response y, we use a prediction interval
Confidence Intervals for Regression Response
*ˆ
ˆμy t SE
A level C confidence interval for the mean response
µy when x takes the value x* is
The standard error
2*
ˆ 2
1μ
x xSE s
n x x
Prediction Intervals for Regression Response
*ˆ
ˆyy t SE
A level C prediction interval for a single observation
on y when x takes the value x*
The standard error
2*
ˆ 2
11y
x xSE s
n x x
Conditions for Regression Inference
• The observations are independent
• The true relationship is linear
• The standard deviation of the response about the true line is the same everywhere.
• The response varies normally about the true regression line.
• Check conditions using the residuals.
Examine the residual plot to check that the relationship is roughly linear and that the scatter about the line is the same from end to end.
Violations of the regression conditions: The variation of the residuals is not constant.
Violations of the regression conditions: There is a curved relationship between the response variable and the explanatory variable.
Example – Page 802, #14.12
A)The residuals for the crying and IQ data appear in
Example 14.3 (page 785). Make a stemplot to display
the distribution of the residuals. Are there outliers or
signs of strong departures from normality?
19.20 31.13 22.65 15.18 12.18 15.15 16.63 6.18
1.70 22.60 6.68 6.17 9.15 23.58 9.14 2.80
9.14 1.66 6.14 12.60 0.34 8.62 2.85 14.30
9.82 10.82 0.37 8.85 10.87 19.34 10.89 2.55
20.85 24.35 18.94 32.89 18.47 51.32
Example – Page 802, #14.123 1
2 4 3 3
1 0 8 5 5 3 2
0 9 9 9 9 7 6 6 6 3 2 2 0
0 0 3 3 9
1 0 1 1 1 4 8 9 9
2 1 4
3 3
4
5 1
One residual (51.32) may be a high
outlier, but the stemplot does not
Show any other deviations from
normality.
Example – Page 802, #14.12
B) What other assumptions or conditions are required for
using inference for regression on these data? Check that
those conditions are satisfied and then describe your
findings.
Example – Page 802, #14.12
Example – Page 802, #14.12The scatter of the data points about the regression line varies
to a extent as we move along the line, but the variation is
not serious, as a residual plot shows. The other conditions can
be assumed to be satisfied.
Example – Page 802, #14.12C) Would a 95% prediction interval for x = 25 be narrower,
the same size, or wider than a 95% confidence interval?
Explain your reasoning.
A prediction interval would be wider. For a fixed
confidence level, the margin of error is always larger
when we are predicting a single observation than when
we are estimating the mean response.
Example – Page 802, #14.12D) A computer package reports that the 95% prediction
interval for x = 25 is (91.85, 165.33). Explain what this
interval means in simple language.
We are 95% confident that when x (crying intensity) = 25,
the corresponding value of y (IQ) will be between 91.85
and 165.33
Example – Page 802, #14.14In exercise 14.11 (page 795) we regressed the lean of the
Leaning Tower of Pisa on year to estimate the rate at which
the tower is tilting. Here are the residuals from that
regression, in order by years across the rows:
4.220 3.099 0.418 1.264 2.055 3.626 2.308
5.011 0.670 4.648 5.967 1.714 7.396
Use the residuals to check the regression conditions, and
describe your findings. Is the regression in exercise 14.11
trustworthy?
Example – Page 802, #14.14In exercise 14.11 (page 795) we regressed the lean of the
Leaning Tower of Pisa on year to estimate the rate at which
the tower is tilting. Here are the residuals from that
regression, in order by years across the rows:
4.220 3.099 0.418 1.264 2.055 3.626 2.308
5.011 0.670 4.648 5.967 1.714 7.396
Use the residuals to check the regression conditions, and
describe your findings. Is the regression in exercise 14.11
trustworthy?
Example – Page 802, #14.14
Residual
Normal Prop.
Of Residual
The scatterplot of the residual versus year does not suggest
any problems. The regression in Exercise 14.11 should be
fairly reliable
Example – Page 809, #14.24Here are data on the time (in minutes) Professor Moore takes
to swim 2000 yards and his pulse rate (beat per minute)
after swimming:
Time: 34.12 35.72 34.72 34.05 34.13 35.72 36.17 35.57
Pulse: 152 124 140 152 146 128 136 144
Time: 35.37 35.57 35.43 36.05 34.85 34.70 34.75 33.93
Pulse: 148 144 136 124 148 144 140 156
Time: 34.60 34.00 34.35 35.62 35.68 35.28 35.97
Pulse: 136 148 148 132 124 132 139
Example – Page 809, #14.24A scatterplot shows a negative linear relationship: a faster
time (fewer minutes) is associated with a higher heart rate.
Here is part of the output from the regression function in
Excel spreadsheets.
Coefficients Standard Error t Stat P-value
Intercepts 479.9341457 66.22779275 7.246718119 3.87075E–07
X variable – 9.694903394 1.888664503 – 5.1332057 4.37908E–05
Give a 90% confidence interval for the slope of the true
regression line. Explain what your result tells us about the
relationship between the professor’s swimming time and
heart rate.
Example – Page 809, #14.24
Coefficients Standard Error t Stat P-value
Intercepts 479.9341457 66.22779275 7.246718119 3.87075E–07
X variable – 9.694903394 1.888664503 – 5.1332057 4.37908E–05
*
21
9.9649 1.721(1.8887)
bb t SE
– 12.9454 to – 6.4444 bpm per minute
With a 90% confidence, we can say that for each
1-minute increase in swimming time, pulse rate
drops by 6 to 13 bpm.
Example – Page 809, #14.24
Using the TI
Example – Page 809, #14.25Exercise 14.24 gives data on a swimmer’s time and heart
rate. One day the swimmer completes his laps in 34.3
minutes but forgets to take his pulse. Minitab gives this
prediction for heart rate when x* = 34.3:
Fit StDev Fit 90.0% CI 90.0% PI
147.40 1.97 (144.02, 150.78) (135.79, 159.01)
A) Verify that “Fit” is the predicted heart rate from the
least-square line found in exercise 14.24. Then choose
one of the intervals from the output to estimate the
swimmer’s heart rate that day and explain why you
chose this interval.
Example – Page 809, #14.25
( ) 479.9 9.6949 ( )y pulse x time
when x = 34.3 minutes
( ) 479.9 9.6949(34.3) 147.37y pulse
this agrees the output
Fit StDev Fit 90.0% CI 90.0% PI
147.40 1.97 (144.02, 150.78) (135.79, 159.01)
Example – Page 809, #14.25
Fit StDev Fit 90.0% CI 90.0% PI
147.40 1.97 (144.02, 150.78) (135.79, 159.01)
The prediction interval is appropriate for estimating one
value (as opposed to mean of many values): 135.79 to
159.01 bpm
Example – Page 809, #14.25
Fit StDev Fit 90.0% CI 90.0% PI
147.40 1.97 (144.02, 150.78) (135.79, 159.01)
B) Minitab gives only one of the two standard errors used
in prediction. It is the standard error for estimating
the mean response. Use this fact and a critical value
from table C to verify Minitab’s 90% confidence interval
for the mean heart rate on days when the swimming time
is 34.3 minutes.
ˆSE
Example – Page 809, #14.25
Fit StDev Fit 90.0% CI 90.0% PI
147.40 1.97 (144.02, 150.78) (135.79, 159.01)
*
21 ˆˆ
147.40 1.721(1.97)
y t SE
144.01 to 150.79, which agrees with the computer
output