Simple Linear Reg

734

Preview

In Chapter 4, you learned how to describe relationships between two numerical

variables. When the relationship was judged to be linear you found the equation

of the least squares regression line and assessed the quality of the fit using the

scatterplot, the residual plot, and the values of the coefficient of determination (r2)

and the standard deviation about the least squares line (se ). In this chapter you will

learn how to make inferences about the slope of the population regression line.

Understanding Relationships—Numerical Data Part 2

16 Preview Chapter Learning Objectives16.1 The Simple Linear Regression

Model16.2 Inferences Concerning the

Slope of the Population Regression Line

16.3 Checking Model Adequacy Are You Ready to Move On? Chapter 16 Review Exercises Technology Notes AP* Review Questions for

Chapter 16

Dani

el M

. Nag

y/Sh

utte

rsto

ck.c

om

734

S e c t i o n V AdditionAl oPPortunitieS to leArn from dAtA

85241_ch16_ptg01.indd 734 20/12/12 6:39 PM

conceptual understandingAfter completing this chapter, you should be able toC1 Understand how probabilistic and deterministic models differ.C2 Understand that the simple linear regression model provides a basis for making inferences about

linear relationships.

mastering the mechanicsAfter completing this chapter, you should be able toM1 Interpret the parameters of the simple linear regression model in context.M2 Use scatterplots, residual plots, and normal probability plots to assess the credibility of the

assumptions of the simple linear regression model.M3 Know the conditions for appropriate use of methods for making inferences about b.M4 Compute the margin of error when the sample slope b is used to estimate a population slope b.M5 Use the five-step process for estimation problems (EMC3) and computer output to construct and

interpret a confidence interval estimate for the slope of a population regression line.M6 Use the five-step process (HMC3) to test hypotheses about the slope of the population

regression line.M7 Use graphs to identify potential outliers and influential points.

Putting it into PracticeAfter completing this chapter, you should be able toP1 Interpret a confidence interval for a population slope in context.P2 Carry out the model utility test and interpret the result in context.

chAPter leArning objectiveS

Preview exAmPle Premature babiesBabies born prematurely (before the 37th week of pregnancy) often have low birth weights. Is a low birth weight related to factors that affect brain function? The authors of the paper “Intrauterine Growth Restriction Affects the Preterm Infant’s Hippocampus”(Pediatric Research [2008]: 438-43) hoped to use data from a study of premature babies to answer this question. They measured x 5 birth weight (in grams) and y 5 hippocampus volume (in mL) for 26 premature babies. The hippocampus is a part of the brain that is important in the development of both short- and long-term memory. The sample correlation coefficient for their data is r 5 0.4722 and the equation of the least squares regression line is

y 5 1.67 1 0.0026x. The pattern in the

scatterplot (Figure 16.1) suggests there may be a positive linear relationship. However, the correlation coefficient is not very large, and the value of the slope is close to zero. Could the pattern observed in the scatterplot—and the nonzero slope—be plausibly explained by chance? That is, is it plausible that there is no relationship between birth weight and hippocampus volume in the population of all premature babies? Or does the sample provide convincing evidence of a linear relationship between these two variables? If there is evidence of a meaningful relationship between these two variables, the regression line could be used to predict the hippocampus volume. If the predicted volume was sufficiently small, early cognitive therapy could be recommended. On the other hand, if there is no meaningful relationship between these variables, low birth weight should not automatically trigger potentially expensive therapy.

735

S e c t i o n V AdditionAl oPPortunitieS to leArn from dAtA

85241_ch16_ptg01.indd 735 20/12/12 6:39 PM

chAPter 16 Understanding Relationships—Numerical Data Part 2736

1.5500 1000 1500

Birth weight

2000 2500

1.6

1.7

1.8

1.9

2.0

Hip

poca

mpu

s vo

lum

e

2.1

2.2

2.3

2.4

In this chapter, you will learn methods that will help you determine if there is a real and useful linear relationship between two variables or if the pattern in the data could be simply due to chance differences that occur when a sample is selected from a population.

FIGuRe 16.1 Scatterplot of birth weight versus hippocampus volume.

A deterministic relationship between two variables x and y is one in which the value of y is completely determined by the value of the independent variable x. A deterministic relation-ship can be described, or “modeled,” using mathematical notation, such as y 5 f (x) where f (x) is a particular function of x. This relationship is deterministic in the sense that the value of the independent variable is all that is needed to determine the value of the dependent variable. For example, you might convert x 5 temperature in degrees centigrade to y 5 temperature in

degrees Fahrenheit using y 5 f (x), where f (x) 5 9 __

5 x 1 32. Once the centigrade temperature

is known, the Fahrenheit temperature is completely determined. Or you might determine

y 5 amount of money in a savings account after x years, using the compound interest for-

mula, y 5 P ( 1 1 r __ n )

nx

, where P is the principal (the amount of money deposited), r is the

interest rate, and n is the number of times each year the interest is compounded. The number of years you leave the principal in the bank determines the amount in the account.

In many situations the variables of interest are not deterministically related. For example, the value of y 5 first-year college grade point average is not determined solely by x 5 high school grade point average, and y 5 crop yield is determined partly by factors other than x 5 amount of fertilizer used. The relationship between two variables, x and y, that are not deter-ministically related can be described by extending the deterministic model to specify a proba-bilistic model. The general form of a probabilistic model allows y to be larger or smaller than f (x) by a random amount e. The model equation for a probabilistic model has the form

y 5 deterministic function of x 1 random deviation 5 f (x) 1 e

In a scatterplot of y versus x, some of the data points will fall above the graph of f (x) and some will fall below. Thinking geometrically, if e . 0, the corresponding point in the scatterplot will lie above the graph of the function y 5 f (x). If e , 0, the corresponding point will fall below the graph of f (x).

For example, consider the probabilistic model

y 5 50 2 10x 1 x2 1 e

___________________ f (x)

The graph of the function y 5 50 2 10x 1 x2 is shown as the orange curve in Figure 16.2. The observed point (4, 30) is also shown in the figure. Because f (4) 5 50 2 10(4) 1 42 5

Section 16.1 the Simple linear regression model

Unless otherwise noted, all content on this page is © Cengage Learning.

85241_ch16_ptg01.indd 736 20/12/12 6:39 PM

16.1 The Simple Linear Regression Model 737

FIGuRe 16.2 A deviation from the deterministic part of a probabilistic model.

DeFInItIonThe simple linear regression model assumes that there is a line with vertical or y intercept a and slope b, called the population regression line. When a value of the independent variable x is fixed and an observation on the dependent variable y is made,

y 5 a 1 bx 1 e

Without the random deviation e, all observed (x, y) points would fall exactly on the population regression line. The inclusion of e in the model equation recognizes that points will deviate from the line by a random amount.

Figure 16.3 shows two observations in relation to the population regression line.

FIGuRe 16.3 Two observations and deviations from the population regression line.

Observation when x = x1(positive deviation)

e1

e2

Observation when x = x2(negative deviation)

Population regressionline (slope b)

a = vertical intercept

x = x1 x = x2

x 0

0

y

26

4

Observation (4, 30)

e = 4

Graph ofy = 50 – 10x + x2

y

x

Simple linear regression model The simple linear regression model is a special case of the general probabilistic model in which the deterministic function, f (x), is linear (so its graph is a straight line).

50 2 40 1 16 5 26 for this point, you can write y 5 f (x) 1 e, where e 5 4. The point (4, 30) falls 4 above the graph of the function, y 5 50 2 10x 1 x2.


85241_ch16_ptg01.indd 737 20/12/12 6:39 PM


The key features of the model are illustrated in Figures 16.4 and 16.5. Notice that the three normal curves in Figure 16.4 have identical spreads. This is a consequence of s

e being the same at any value of x, which implies that the variability in the y values at a

particular value of x is constant—the variability does not depend on the value of x.

The simple linear regression model assumptions about the variability in the values of e in the population imply that there is also variability in the y values observed at any particular value of x. Consider y when x has some fixed value x*, so that

y 5 a 1 bx* 1 e.

Because a and b are fixed (they are unknown population values), a 1 bx* is also a fixed number. The sum of a fixed number and a normally distributed variable (e) is also a normally distributed variable (the bell-shaped curve is simply shifted), so y itself has a normal distribution. Furthermore, m

e 5 0 implies that the mean value of y

is a 1 bx*, the height of the population regression line for the value x 5 x*. Finally, because there is no variability in the fixed number a 1 bx*, the standard deviation of y is the same as the standard deviation of e. These properties are summarized in the following box.

BAsIC AssuMPtIons oF tHe sIMPle lIneAR ReGRessIon MoDel1. The distribution of e at any particular x value has mean value 0. That is, m

e 5 0.

2. The standard deviation of e (which describes the spread of its distribution) is the same for any particular value of x. This standard deviation is denoted by s

e.

3. The distribution of e at any particular x value is normal.4. The random deviations e

1, e

2, ..., e

n associated with different observations are

independent of one another.

Before you actually observe a value of y for any particular value of x, you are uncertain about the value of e. It could be negative, positive, or even 0. Also, e might be quite large in magnitude (resulting in a point far from the population regression line) or quite small (resulting in a point very close to the line). The simple linear regression model makes some assumptions about the distribution of e at any particular x value in the population.

At any fixed value x*, y has a normal distribution, with

( mean y value ___________ for x* ) 5 ( height of the population

____________________ regression line above x* ) 5 a 1 bx*

and

standard deviation of y for a fixed value x* 5 se

The slope b of the population regression line is the mean or expected change in y associated with a 1-unit increase in x. The y intercept a is the height of the population line when x 5 0.

The value of se determines how much the (x, y) observations deviate vertically

from the population line; when se is small, most observations will be close to

the line, but when se is large, the observations will tend to deviate more from

the line.

85241_ch16_ptg01.indd 738 20/12/12 6:39 PM


The authors of the article “on Weight loss by Wrestlers Who Have Been standing on their Heads” (paper presented at the Sixth International Conference on Statistics, Combinatorics, and Related Areas, Forum for Interdisciplinary Mathematics, 1999, with the data also appearing in A Quick Course in Statistical Process Control, Mick Norton, 2005) state that “amateur wrestlers who are overweight near the end of the weight certification period, but just barely so, have been known to stand on their heads for a minute or two, get on their feet, step back on the scale, and establish that they are in the desired weight class. Using a headstand as the method of last resort has become a fairly common practice in amateur wrestling.”

Does this really work? Data were collected in an experiment where weight loss was recorded for each wrestler after exercising for 15 minutes and then doing a headstand for 1 minute 45 sec. Based on these data, the authors of the article concluded that there was in fact a demonstrable weight loss that was greater than that for a control group that exercised for 15 minutes but did not do the headstand. (The authors give a plausible explanation for why this might be the case based on the way blood and other body fluids collect in the head during the headstand and the effect of weighing while these fluids are draining immedi-ately after standing.) The authors also concluded that a simple linear regression model was a reasonable way to describe the relationship between the variables

y 5 weight loss (in pounds)

and

x 5 body weight prior to exercise and headstand (in pounds)

example 16.1 Stand on Your head to lose weight?

FIGuRe 16.4 Illustration of the simple linear regression model.

a + bx3

a + bx2

a + bx1

Mean value a + bx1

Standard deviation sNormal curve

Mean value a + bx2Standard deviation sNormal curve

Mean value a + bx3Standard deviation sNormal curve

y =the populationregression line(line of mean values)

a + bx,

x1 x2 x3

Three different x values

y

x

FIGuRe 16.5 The simple linear regression model: (a) small se ; (b) large se

(a)

Population regressionline

(b)

Population regressionline


85241_ch16_ptg01.indd 739 20/12/12 6:39 PM


Suppose that the actual model equation has a 5 0, b 5 0.001, and se 5 0.09 (these

values are consistent with the findings in the article). The population regression line is shown in Figure 16.6.

FIGuRe 16.6 The population regression line for Example 16.1

Populationregression liney = 0.001x

x = 190

x

y

= 0.19Mean y when

x = 190( )

If the distribution of the random errors at any fixed weight (x value) is normal, then the variable y 5 weight loss is normally distributed with

my 5 0 1 0.001x sy 5 0.09

For example, when x 5 190 (corresponding to a 190-pound wrestler), weight loss has mean value

my 5 0 1 0.001(190) 5 0.19 pounds

Because the standard deviation of y is sy 5 0.09, the interval 0.19 6 2(0.09) 5 (0.01,

0.37) includes y values that are within 2 standard deviations of the mean value for y when x 5 190. Roughly 95% of the weight loss observations made for 190-lb wrestlers will be in this range. The slope b 5 0.001 can be interpreted as the mean change in weight associated with each additional pound of body weight.

More insight into model properties can be gained by thinking of the population of all (x, y) pairs as consisting of many smaller subpopulations. Each subpopulation contains pairs for which x has a fixed value. Suppose, for example, that in a large population of college students the variables

x 5 grade point average in major courses

and

y 5 starting salary after graduation

are related according to the simple linear regression model. Then you can think about the subpopulation of all pairs with x 5 3.20 (corresponding to all students with a grade point average of 3.20 in major courses), the subpopulation of all pairs having x 5 2.75, and so on. The model assumes that for each of these subpopulations, y is normally distributed with the same standard deviation, and that the mean y value (rather than y itself) is linearly related to x.

In practice, the judgment of whether the simple linear regression model is appropriate—that is the judgments about the credibility of the assumptions underlying the linear model—must be based on knowledge of how the data were collected, as well as an inspection of various plots of the data and the residuals. The sample observations should be independent of one another, which will be the case if the data are from a random sample. In addition, the scatterplot should show a linear rather than a curved pattern, and the verti-cal spread of points should be very similar throughout the range of x values. Figure 16.7 shows plots with three different patterns; only the first pattern is consistent with the simple linear regression model assumptions.


85241_ch16_ptg01.indd 740 20/12/12 6:39 PM


The estimates of the slope and the y intercept of the population regression line are the slope and y intercept, respectively, of the least squares line. That is,

b 5 estimate of b 5 ∑(x 2

_ x )(y 2

_ y ) ______________

∑(x 2 _ x )2

a 5 estimate of a 5 _ y 2 b

_ x

The values of a and b are usually obtained using statistical software or a graphing calculator. If the slope and intercept are calculated by hand, you can use the fol-lowing computational formula:

b 5 ∑xy 2

(∑ x)(∑ y) ________ n _____________

∑ x2 2 (∑ x)2

_____ n

The estimated regression line is the familiar least squares line

y 5 a 1 bx

Let x* denote a specified value of the independent variable x. Then a 1 bx* has two different interpretations:

1. It is a point estimate of the mean y value when x 5 x*.2. It is a point prediction of an individual y value to be observed when x 5 x*.

example 16.2 mother’s Age and baby’s birth weightMedical researchers have noted that adolescent females are much more likely to deliver low-birth-weight babies than are adult females. (Low birth weight in humans is generally defined as a weight below 2,500 grams) Because low-birth-weight babies have higher mortality rates, a number of studies have examined the relationship between birth weight and mother’s age for babies born to young mothers.

One such study is described in the article “Body size and Intelligence in 6-Year-olds: Are offspring of teenage Mothers at Risk?” (Maternal and Child Health Journal [2009]: 847-856). The following data on

x 5 maternal age (in years)

and

y 5 birth weight of baby (in grams)

FIGuRe 16.7 Some commonly encountered patterns in scatter plots: (a) Consistent with the simple linear regression model; (b) Suggests a nonlinear probabilistic model; (c) Suggests that variability in y changes with x.

y

x

(a)

y

x

(b)

y

x

(c)

x

estimating the Population regression line In Section 16.3, you will see how to check whether the basic assumptions of the simple linear regression model are reasonable. When this is the case, the values of a and b (y intercept and slope of the population regression line) can be estimated from sample data.

The estimates of a and b are denoted by a and b, respectively. These estimates are the values of the intercept and slope of the least squares regression line. Recall that that the least squares regression line is the line for which the sum of squared vertical deviations of points in the scatterplot from the line is smaller than for any other line.


85241_ch16_ptg01.indd 741 20/12/12 6:39 PM


FIGuRe 16.8 Scatterplot of birth weight versus maternal age for Example 16.2.

15 16 17 18 19

2500

3000

3500

Mother’s age (yr)

Baby’s weight (g)

are consistent with summary values given in the article and also with data published by the

National Center for Health Statistics.

Observation

1 2 3 4 5 6 7 8 9 10

x 15 17 18 15 16 19 17 16 18 19

y 2,289 3,393 3,271 2,648 2,897 3,327 2,970 2,535 3,138 3,573

A scatterplot of the data is given in Figure 16.8. The scatterplot shows a linear pattern, and the spread in the y values appears to be similar across the range of x values. This supports the appropriateness of the simple linear regression model.

For these data, the equation of the estimated regression line was found using statistical software, resulting in

y 5 a 1 bx 5 21,163.45 1 245.15x

An estimate of the mean birth weight of babies born to 18-year-old mothers results from substituting x 5 18 into the estimated equation:

estimated mean y for 18-year-old mothers 5 a 1 bx 5 21,163.45 1 245.15(18)

5 3,249.25 grams

Similarly, you would predict the birth weight of a baby to be born to a particular 18-year-old mother to be

y 5 predicted y value when x 5 18

5 a 1 b(18)

5 3,249.25 grams

The estimate of the mean weight and the prediction of an individual baby weight are identical, because the same x value was used in each calculation. However, their interpreta-tions differ. One is the prediction of the weight of a single baby whose mother is 18, whereas the other is an estimate of the mean weight of all babies born to 18-year-old mothers.

In Example 16.2, the x values in the sample ranged from 15 to 19. The estimated regression equation should not be used to make an estimate or prediction for any x value much outside this range. Without sample data for such values, or some clear theoretical reason for expecting the relationship to be linear outside the observed range of x values, you have no reason to believe that the estimated linear relationship continues outside the range from 15 to 19. Making predictions outside this range can be misleading, and statisti-cians refer to this as the danger of extrapolation.


85241_ch16_ptg01.indd 742 20/12/12 6:39 PM


estimating s e 2 and se

The value of se determines the extent to which observed points (x, y) tend to fall close to

or far away from the population regression line. A point estimate of se is based on

SSResid 5 ∑( y 2 y )2

where y 1 5 a 1 bx

1, …,

y n 5 a 1 bx

n are the fitted or predicted y values and the residuals

are y1 2

y 1,… y

n 2

y n. SSResid is a measure of the extent to which the sample data spread

out around the estimated regression line.

DeFInItIonThe statistic for estimating the variance s

e 2 is

s e 2 5

SSResid _______

n 2 2

where

SSResid 5 ∑(y 2 y )2 5 ∑y2 2 a ∑y 2 b ∑xy

The subscript in s e 2 and s

e 2 is a reminder that you are estimating the variance of the

“errors” or residuals.

The estimate of se is the estimated standard deviation

se 5 Ï

__ s

e 2

The number of degrees of freedom associated with estimating s e 2 or s

e in simple

linear regresssion is n 2 2.

The coefficient of determination, r2, is the proportion of variability in y that can be explained by the approximate linear relationship between x and y.

The value of se, the estimated standard deviation about the population regression

line, is interpreted as the typical amount by which an observation deviates from the population regression line.

The estimates and number of degrees of freedom here have analogs in previous work involving a single sample x

1, x

2, …, x

n. The sample variance s2 had a numerator of

∑(x 2 _ x )2, a sum of squared deviations (residuals), and denominator n 2 1, the number of

degrees of freedom associated with s2 and s. The use of _ x as an estimate of m in the formula

for s2 reduces the number of degrees of freedom by 1, from n to n 21. In simple linear regression, estimation of two quantities, a and b, results in a loss of 2 degrees of freedom, leaving n 2 2 as the number of degrees of freedom associated with SSResid, s

e 2 and s

e.

Once the estimated regression equation has been found, the usefulness of this model is evaluated using a residual plot and the values of s

e and the coefficient of determination,

r2. Recall from Chapter 4 that the values of se and r2 are interpreted as described in the

following box.

Wildlife biologists monitor the ecological health of animals. For large animals whose habi-tat is relatively inaccessible, this can present some practical problems. The Rocky Mountain elk is the fourth largest deer species and is a case in point. Males range up to 7.5 feet in length and over 500 pounds in weight. The equipment, manpower, and time needed to weigh these creatures make direct measurement of weight difficult and expensive. The authors of the paper “estimating elk Weight From Chest Girth” (Wildlife Society Bulletin [1996]: 58-611) found they could reliably estimate elk weights by a much more practical method: measuring the chest girth and then using linear regression to estimate the weight. They measured the chest girth and weight of 19 Rocky Mountain elk in Custer State Park, South Dakota. The

example 16.3 estimating elk weight

85241_ch16_ptg01.indd 743 20/12/12 6:39 PM


The scatterplot (Figure 16.9) gives evidence of a strong positive linear relationship between

x 5 chest girth (in cm)and

y 5 weight in (kg)

Girth (cm) Weight(kg)Predicted

y Value Residual

96 87 136.266 238.2661105 196 161.069 34.9314108 163 169.336 26.3361109 196 172.092 23.9080110 183 174.848 8.1522114 171 185.871 214.8711121 230 205.162 24.8380124 225 213.429 11.5705131 211 232.720 221.7203135 231 243.744 212.7436137 225 249.255 224.2553138 266 252.011 13.9889140 241 257.523 216.5228142 264 263.034 0.9655157 284 304.372 220.3720157 292 304.372 212.3720159 300 309.884 29.8837155 337 298.860 38.1397162 339 318.151 20.8488

FIGuRe 16.9 Scatterplot of weight versus chest girth for Example 16.3

350

300

250

200

150

100

90 100 110 120 130

Girth (cm)

Wei

ght (

kg)

140 150 160 170

resulting data (from a scatterplot in the paper) is given in the accompanying table. The table also includes the predicted values and residuals for the estimated regression line.

Partial Minitab regression output is shown here.

Regression Analysis: Weight versus GirthThe regression equation isWeight 5 2 136 1 2.81 Girth

Predictor Coef SE Coef T PConstant 2135.51 35.75 23.79 0.001Girth 2.8063 0.2686 10.45 0.000

S 5 23.6626 R-Sq 5 86.5% R-Sq(adj) 5 85.7%


85241_ch16_ptg01.indd 744 20/12/12 6:39 PM


From the output,

y 5 2136 1 2.81x

r2 5 0.865

Se 5 23.6626

Approximately 86.5% of the observed variation in elk weight y can be attributed to the linear relationship between weight and chest girth. The magnitude of a typical deviation from the least-squares line is about 23.6626 kg, which is relatively small in comparison to the y values themselves.

each exercise set assesses the following chapter learning objectives: c1, m1

Section 16.1 exercise set 1

16.1 Identify the following relationships as deterministic or probabilistic:a. The relationship between the length of the sides of a

square and its perimeter.

b. The relationship between the height and weight of an adult.c. The relationship between SAT score and college freshman

GPA.d. The relationship between tree height in centimeters and

tree height in inches.

16.2 Let x be the size of a house (in square feet) and y be the amount of natural gas used (therms) during a specified period. Suppose that for a particular community, x and y are related according to the simple linear regression model with

b 5 slope of population regression line 5 .017

a 5 y intercept of population regression line 5 25.0

Houses in this community range in size from 1000 to 3000 square feet.a. What is the equation of the population regression line?b. Graph the population regression line by first finding the

point on the line corresponding to x 5 1000 and then the point corresponding to x 5 2000, and drawing a line through these points.

c. What is the mean value of gas usage for houses with 2100 sq. ft. of space?

d. What is the average change in usage associated with a 1 sq. ft. increase in size?

e. What is the average change in usage associated with a 100 sq. ft. increase in size?

f. Would you use the model to predict mean usage for a 500 sq. ft. house? Why or why not?

16.3 Suppose that a simple linear regression model is appropriate for describing the relationship between y 5

house price (in dollars) and x 5 house size (in square feet) for houses in a large city. The population regression line isy 5 23,000 1 47x and s

e 5 5000.

a. What is the average change in price associated with one extra square foot of space? With an additional 100 sq. ft. of space?

b. Approximately what proportion of 1800 sq. ft. homes would be priced over $110,000? Under $100,000?


16.4 Identify the following relationships as deterministic or probabilistic:

a. The relationship between height at birth and height at one year of age.

b. The relationship between a positive number and its square root.

c. The relationship between temperature in degrees Fahrenheit and degrees centigrade.

d. The relationship between adult shoe size and shirt size.

16.5 The flow rate in a device used for air quality measure-ment depends on the pressure drop x (inches of water) across the device’s filter. Suppose that for x values between 5 and 20, these two variables are related according to the simple linear regression model with population regression liney 5 20.12 1 0.095x.a. What is the mean flow rate for a pressure drop of

10 inches? A drop of 15 inches?b. What is the average change in flow rate associated with

a 1 inch increase in pressure drop? Explain.

16.6 The paper “Predicting Yolk Height, Yolk Width, Albumen length, eggshell Weight, egg shape Index, eggshell thickness, egg surface Area of Japanese Quails using Various egg traits as Regressors” (International Journal of

Section 16.1 exerciSeS

Another important assumption of the simple linear regression model is that the random deviations at any particular x value are normally distributed. In Section 16.3, you will see how the residuals can be used to determine whether this assumption is plausible.

85241_ch16_ptg01.indd 745 20/12/12 6:39 PM


Poultry Science [2008]: 85–88) suggests that the simple linear regression model is reasonable for describing the relationship between y 5 eggshell thickness (in microm-eters) and x 5 egg length (mm) for quail eggs. Suppose that the population regression line is y 5 0.135 1 0.003x and that s

e 5 0.005. Then, for a fixed x value, y has a nor-

mal distribution with mean 0.135 1 0.003x and standard deviation 0.005.a. What is the mean eggshell thickness for quail eggs that

are 15 mm in length? For quail eggs that are 17 mm in length?

b. What is the probability that a quail egg with a length of 15 mm will have a shell thickness that is greater than 0.18 mm?

c. Approximately what proportion of quail eggs of length 14 mm has a shell thickness of greater than 0.175? Less than 0.178?

Additional exercises16.7 Tom and Ray are managers of electronics stores with slightly different pricing strategies for USB drives. In Tom’s store, customers pay the same amount, c, for each USB drive. In Ray’s store, it is a little more exciting. The cus-tomer pays an up-front cost of $1.00. Ray charges the same price per USB drive, c, but at the register the customer flips a coin. If the coin lands heads up, the customer gets his or her $1.00 back, plus another dollar off the total cost of the USB drives purchased.a. Which of these pricing strategies can be expressed as a

deterministic model?b. Using mathematical notation, specify a model using

Tom’s pricing strategy that relates y 5 total cost to x 5 number of USB drives purchased.

c. Using mathematical notation, specify a model using Ray’s pricing strategy that relates y 5 total cost to x 5 number of USB drives purchased.

d. Describe the distribution of e for the probabilistic model described above. What is the mean of the distribution of e? What is the standard deviation of e?

16.8 Identify the following relationships as deterministic or probabilistic:a. The relationship between the speed limit and a driver’s

speed.b. The relationship between the price in dollars and the

price in Euros of an object.c. The relationship between the number of pages and the

number of words in a text book.d. The relationship between the possible numbers of pen-

nies and the nickels in a pile if no other coins are in the pile and the amount of money in the pile is $3.00.

16.9 Hormone replacement therapy (HRT) is thought to increase the risk of breast cancer. The accompanying data on x 5 percent of women using HRT and y 5 breast cancer incidence (cases per 100,000 women) for a region in

Germany for 5 years appeared in the paper “Decline in Breast Cancer Incidence after Decrease in utilisation of Hormone Replacement therapy” (Epidemiology [2008]: 427–430). The authors of the paper used a simple linear regression model to describe the relationship between HRT use and breast cancer incidence.

HRt use Breast Cancer Incidence

46.30 103.3040.60 105.0039.50 100.0036.60 93.8030.00 83.50

a. What is the equation of the estimated regression line?b. What is the estimated average change in breast cancer

incidence associated with a 1 percentage point increase

in HRT use?c. What would you predict the breast cancer incidence to be

in a year when HRT use was 40%?d. Should you use this regression model to predict breast

cancer incidence for a year when HRT use was 20%? Explain.

e. Calculate and interpret the value of r 2.

f. Calculate and interpret the value of se.

16.10 Consider the accompanying data on x 5 advertising share and y 5 market share for a particular brand of soft drink during 10 randomly selected years.

x 0.103 0.072 0.071 0.077 0.086 0.047 0.060 0.050 0.070 0.052y 0.135 0.125 0.120 0.086 0.079 0.076 0.065 0.059 0.051 0.039

a. Construct a scatterplot for these data. Do you think the simple linear regression model would be appropriate for describing the relationship between x and y?

b. Calculate the equation of the estimated regression line and use it to obtain the predicted market share when the advertising share is 0.09.

c. Compute r 2. How would you interpret this value?d. Calculate a point estimate of s

e. How many degrees of

freedom is associated with this estimate?

16.11 The authors of the paper “Weight-Bearing Activity During Youth Is a More Important Factor for Peak Bone Mass than Calcium Intake” (Journal of Bone and Mineral

Research [1994], 1089–1096) studied a number of variables they thought might be related to bone mineral density (BMD). The accompanying data on x 5 weight at age 13 and y 5 bone mineral density at age 27 are consistent with summary quantities for women given in the paper.

85241_ch16_ptg01.indd 746 20/12/12 6:39 PM


The accompanying computer output is from JMP.

Weight (kg) BMD (g/cm2)

54.4 1.1559.3 1.2674.6 1.4262.0 1.0673.7 1.4470.8 1.0266.8 1.2666.7 1.3564.7 1.0271.8 0.9169.7 1.2864.7 1.1762.1 1.1268.5 1.2458.3 1.00

1.5

1.4

1.3

1.2

1.1

1

0.9

0.855 60 65 70 75

Weight (kg)

Linear Fit

Linear Fit

Summary of FitBMD (g/cm^2) = 0.5584011 + 0.0094363*Weight (kg)

RSquareRSquare AdjRoot Mean Square ErrorMean of ResponseObservations (or Sum Wgts)

Term Estimate Std Error t Ratio Prob>|t|

Lack of Fit

Analysis of Variance

Parameter Estimates

BMD

(g/c

m^2

)

0.1210810.0534720.1551411.18

15

InterceptWeight (kg)

0.55840110.0094363 0.007051

1.201.34

0.25240.2038

0.466212

6

5

4

3

2

1

025 30

Bivariate Fit of Emb Ves Diam (cm) By Gest Age (days)

Linear Fit

Summary of Fit

Lack of Fit


Parameter Estimates

Gest Age (days)

Linear Fit

Emb Ves Diam (cm) = –3.497279 + 0.1903121*Gest Age (days)

Emb

Ves

Diam

(cm

)

35 40


0.7928030.7806150.4505872.482526

19

Term Estimate Std Error t Ratio Prob>|t|InterceptGest Age (days)

–3.4972790.1903121 0.023597

–4.678.07

0.0002*<.0001*

0.748605

d. Compute a point estimate of the mean BMD at age 27 for women whose age 13 weight was 60 kg.

16.12 The production of pups and their survival are the most significant factors contributing to gray wolf population growth. The causes of early pup mortality are unknown and difficult to observe. The pups are concealed within their dens for 3 weeks after birth, and after they emerge it is difficult to confirm their parentage. Researchers recently used portable ultrasound equipment to investigate some factors related to reproduc-tion (“Diagnosing Pregnancy, in utero litter size, and Fetal Growth with ultrasound in Wild, Free-Ranging Wolves,” Journal of Mammology [2006]: 85-92). A scatterplot of y 5 length of an embryonic sac diameter (in cm) and x 5 gestational age (in days) is shown below. Computer output from a regression analysis is also given.

a. What percentage of observed variation in BMD at age 27 can be explained by the simple linear regression model?

b. Give a point estimate of se and interpret this estimate.

c. Give an estimate of the average change in BMD associ-ated with a 1 kg increase in weight at age 13.

a. What is the equation of the estimated regression line?b. What is the estimated embryonic sac diameter for a

gestational age of 30 days?c. What is the average change in sac diameter associated

with a 1-day increase in gestational age?d. What is the average change in sac diameter associated

with a 5-day increase in gestational age?e. Would you use this model to predict the mean embry-

onic sac diameter for all gestation ages from conception to birth? Why or why not?


85241_ch16_ptg01.indd 747 20/12/12 6:39 PM


Inferences about the slope of the population regres-sion line are based on the sampling distribution of the statistic b. The proper-ties given here depend on the four basic assumptions of the linear regression model being met. In Sec-tion 16.3, you will see how to determine if these as-sumptions are reasonable.

AP* exAM tIP

For inferences about the slope of the population re-gression line, df 5 n 2 2.

AP* exAM tIP

When the four basic assumptions of the simple linear regression model are satisfied

1. The mean value of the sampling distribution of b is b. That is, mb 5 b , so the

sampling distribution of b is always centered at the value of b. This means that b is an unbiased statistic for estimating b.2. The standard deviation of the sampling distribution of the statistic b is

sb 5

se __________

Ï ________

∑(x 2 _ x )2

3. The statistic b has a normal distribution (a consequence of the model assump-tion that the random deviation e is normally distributed).

Properties of the sampling Distribution of b

The fact that b is unbiased tells you that the sampling distribution is centered at the right place, but it gives no information about variability. If s

b is large, the sampling distribution of

b will be quite spread out around b and an estimate far from the value of b could result. For

sb 5

se ___________

Ï ________

∑(x 2 _ x )2

to be small, the numerator se should be small (little variability about the

population line) and/or the denominator Ï ________

∑(x 2 _ x )2 should be large. Because ∑(x 2

_ x )2 is a

measure of how much the observed x values spread out, b tends to be more precisely estimated when the x values in the sample are spread out rather than when they are close together. The normality of the sampling distribution of b implies that the standardized variable

z 5 b 2 b

______ s

b

has a standard normal distribution. However, inferential methods cannot be based on this statistic, because the value of s

b is not known (because the unknown s

e appears in the

numerator of sb). One way to proceed is to estimate s

e with s

e to obtain an estimate of s

b.

The estimated standard deviation of the statics b is

sb 5

se ___________

Ï ________

∑(x 2 _ x )2

When the four basic assumptions of the simple linear regression model are satisfied,

the probability distribution of the standardized variable t 5 b 2 b

______ sb is the

t distribution with df 5 ( n 2 2 ) .

Section 16.2 inferences concerning the Slope of the Population regression line

The slope coefficient b in the simple linear regression model represents the average or expected change in the response variable y that is associated with a 1-unit increase in the value of the independent variable x. For example, consider x 5 the size of a house (in square feet) and y 5 selling price of the house. If the simple linear regression model is appropriate for the population of houses in a particular city, b would be the average increase in selling price associated with a 1-square-foot increase in size. As another example, if x 5 amount of time per week a computer system is used and y 5 the resulting annual maintenance expense, then b would be the expected change in expense associated with using the computer system one additional hour per week.

Because the value of b is almost always unknown, it must be estimated from sample data. The slope of the least squares regression line, b, provides an estimate. In some situa-tions, the value of the statistic b may vary greatly from sample to sample, and the value of b computed from a single sample may be quite different from the value of the population slope, b. In other situations, almost all possible samples result in a value of b that is quite close to b. The sampling distribution of b provides information about the behavior of this statistic.

85241_ch16_ptg01.indd 748 20/12/12 6:39 PM

16.2 Inferences Concerning the Slope of the Population Regression Line 749

In the same way that t 5 _ x 2 m

______

s ____

Ï __

n was used in Chapter 12 to develop a confidence inter-

val for m, the t variable in the preceding box can be used to obtain a confidence interval for b.

When the four basic assumptions of the simple linear regression model are satis-fied, a confidence interval for b, the slope of the population regression line, has the form

b 6 ( t critical value ) sb

where the t critical value is based on df 5 n 2 2. Appendix Table 3 gives critical values corresponding to the most frequently used confidence levels.

Confidence Interval for b

Q Question type

Estimation or hypothesis testing?

s study type

Sample data or experiment data?

t type of Data

One variable or two? Categorical or numerical?

n number of samples or

treatments

How many samples or treatments?

example 16.4 the bison of Yellowstone ParkThe dedicated work of conservationists for over 100 years has brought the bison in Yellowstone National Park from near extinction to a herd of over 3,000 animals. This recovery is a mixed blessing. Many bison have been exposed to the bacteria that cause brucellosis, a disease that infects domestic cattle, and there are many domestic cattle herds near Yellowstone. Because of concerns that free-ranging bison can infect nearby cattle, it is important to monitor and manage the size of the bison population and, if possible, keep bison from transmitting this bacteria to ranch cattle. The article “Reproduction and survival of Yellowstone Bison” (The Journal of Wildlife Management [2007]: 2365-2372) described a large multiyear study of the factors that influence bison movement and herd size. The

The interval estimate of b is centered at b and extends out from the center by an amount that depends on the sampling variability of b. When s

b is small, the interval is narrow, imply-

ing that the investigator has relatively precise knowledge of the value of b. Calculation of a confidence interval for the slope of a population regression line is illustrated in Example 16.4.

In Section 7.2, you learned four key questions that guide the decision about what sta-tistical inference method to consider in any particular situation. In Section 7.3, a five-step process for estimation problems was introduced.

The four key questions of section 7.2 were

When the answers to these questions are

Q: estimationS: sample dataT: two numerical variablesN: one sample

the method you will want to consider in a regression setting is the confidence interval for the slope of a population regression line.

Once you have selected the confidence interval for the slope of a population regres-sion line as the method you want to consider, because this is an estimation problem you would follow the five-step process for estimation problems (EMC3).

85241_ch16_ptg01.indd 749 20/12/12 6:39 PM


researchers studied a number of environmental factors to better understand the relation-ship between bison reproduction and the environment. One factor thought to influence reproduction is stress due to accumulated snow, which makes foraging more difficult for the pregnant bison. Data from 1981–1997 on

y 5 spring calf ratio (SCR)

and

x 5 previous fall snow-water equivalent (SWE)

are shown in the accompanying table. Spring calf ratio is the ratio of calves to adults, a measure of reproductive success. The researchers were interested in estimating the mean change in spring calf ratio associated with each additional cm in snow-water equivalent.

Let’s answer the four key questions for this problem.

combination of answers suggests considering a confidence interval for the slope of a popu-lation regression line. You can now use the five-step process (EMC3) to estimate the slope of the population regression line.

step

Estimate In this example, the value of b, the mean increase in spring calf ratio for each additional 1 cm of snow-water equivalent, will be estimated.

Method Because the answers to the four key questions are estimation, sample data, two numerical values, one sample, a confidence interval for b, the slope ofthe population regression line, will be considered.For this example, a 95% confidence level will be used.

Check The four basic assumptions of the simple linear regression model need to be met in order to use the confidence interval.

(continued)

The answers are estimation, sample data, two numerical variables, one sample. This

sCR sWe

0.19 1,9330.14 4,9060.21 3,0720.23 2,5430.26 3,5090.19 3,9080.29 2,2140.23 2,8160.16 4,128

sCR sWe

0.22 3,3170.22 3,3320.18 3,5110.21 3,9070.25 2,5330.19 4,6110.22 6,2370.17 7,279

Q Question type

Estimation or hypothesis testing? Estimation

s study type

Sample data or experiment data? Sample data

t type of Data

One variable or two? Categorical or numerical? Two numerical values

n number of samples

or treatmentsHow many samples or treatments? One sample (regression)

85241_ch16_ptg01.indd 750 20/12/12 6:39 PM


(continued)

step

The investigators collected data from 17 successive years. To proceed, you would need to assume that these years are representative of yearly circum-stances at Yellowstone, and that each year’s reproduction and snowfall is independent of previous years. You should keep this in mind when you get to the step that involves interpretation.A scatterplot of the data is shown here. The pattern in the plot looks linear and the spread does not seem to be different for different values of x.

0.300

0.275

0.250

0.225

0.200

0.175

0.150

2000 3000 4000 5000 6000

SWE

SCR

7000 8000

A box plot of the residuals is also shown.

Because the boxplot is approximately symmetric and there are no outliers, it is reasonable to think that the distribution of e isapproximately normal.

Calculate JMP regression output is shown here:

Linear FitSCR 5 0.2606561 2 0.0136639*SWE

Summary of FitRSquare 0.257644RSquare Adj 0.208153Root Mean Square Error 0.033513Mean of Response 0.209412 Observations (or Sum Wgts) 17

–0.050 –0.025 –0.000

Residuals

0.025 0.050 0.075


85241_ch16_ptg01.indd 751 20/12/12 6:39 PM


(continued)

hypothesis tests concerning b Hypotheses about b can be tested using a t test similar to the t tests introduced in Chapters 12 and 13. The null hypothesis states that b has a specified hypothesized value. The t statistic results from standardizing b, the estimate of b, under the assumption that H

0 is true. When

H0 is true, the sampling distribution of this statistic is the t distribution with df 5 n 2 2.

step

Parameter EstimatesTerm Estimate Std Error t Ratio Prob>|t|Intercept 0.2606561 0.023885 10.91 <.0001*SWE 20.013664 0.005989 22.28 0.0375*

df 5 n 2 2 = 17 2 2 = 15The t critical value for a 95% confidence level and df 5 15 is 2.13.b 6(t critical value)s

b

5 20.0137 6(2.13)(0.00599)5 (20.265, 20.0009)

CommunicateResults

Confidence interval:You can be 95% confident that the true average change in spring calf ratio associated with an increase of 1 cm in the snow-water equivalent is between 20.0265 and 20.0009.

Confidence level:The method used to construct this interval estimate is successful in capturing the actual value of the slope of the population regression about 95% of the time.

Hypothesis test for the slope of the Population Regression line, b

Appropriate when the four basic assumptions of the simple linear regression model are reasonable:

1. The distribution of e at any particular x value has mean value 0 (that is m

e5 0 ).

2. The standard deviation of e is se, which does not depend on x.

3. The distribution of e at any particular x value is normal.4. The random deviations e

1, e

2, e

3, … e

n associated with different observations are

independent of one another.

When these conditions are met, the following test statistic can be used:

t 5 b 2 b

0 ______ sb

where b0 is the hypothesized value from the null hypothesis.

Form of the null hypothesis: H0: b 5 b

0

When the assumptions of the simple linear regression model are reasonable and the null hypothesis is true, the t test statistic has a t distribution with df 5 n 2 2.

Associated P-value:

When the alternative hypothesis is…

The P-value is…

Ha: b . b

0Area to the right of the computed t under the appropriate t curve

sb

85241_ch16_ptg01.indd 752 20/12/12 6:39 PM


Ha: b , b

0Area to the left of the computed t under the appropriate t curve

Ha: b Þ b

02(area to the right of t) if t is positive

or

2(area to the left of the t) if t is negetive

This test is a method you should consider when the answers to the four key questions are hypothesis testing, sample data, two numerical variables, one sample. You would carry out this test using the five-step process for hypothesis testing problems (HMC3).

Inference for a population slope generally focuses on two questions:

(1) Is the population slope different from zero? (2) What are plausible values for the population slope?

The question of plausible values can be addressed by calculating a confidence interval for the population slope. The question of whether a population slope is equal to zero can be answered by using the hypothesis testing procedure with a null hypothesis H

0: b 5 0. This

test of H0: b 5 0 versus H

a: b Þ 0 is called the model utility test for simple linear regression.

The default computer output for inference for a regression slope is for the model utility test.When the null hypothesis of the model utility test is true, the population regression

line is a horizontal line, and the value of y in the simple linear regression model does not depend on x. That is,

y 5 a 1 bx 1 e

5 a 1 0x 1 e

5 a 1 e

If b is in fact equal to 0, knowledge of x will be of no use — it will have no “utility” for predicting y. On the other hand, if b is different from 0, there is a useful linear relation-ship between x and y, and knowledge of x is useful for predicting y. This is illustrated by the scatterplots in Figure 16.10.

The model utility test for simple linear regression is the test of

H0: b 5 0

versus

Ha: b Þ 0

The null hypothesis specifies that there is no useful linear relationship between x and y, whereas the alternative hypothesis specifies that there is a useful linear

the Model utility test for simple linear Regression

(continued)

FIGuRe 16.10 (a) b 5 0; (b) b Þ 0

x

slope = 0

y

(a)

x

nonzero slopey

(b)


85241_ch16_ptg01.indd 753 20/12/12 6:39 PM


relationship between x and y. If H0 is rejected, you can conclude that the simple

linear regression model is useful for predicting y.

The test statistic is the t ratio

t 5 ( b 2 0 ) _______ s

b 5 b __ s

b .

It is recommended that the model utility test be carried out before using the estimated regression line to make inferences.

Have you experienced a sudden flood of memory when scanning from station to station on your car radio and recognized a song from your past? Perhaps you could remember the title of the song, the artist, and even when the song was released. From a seemingly small amount of information you were able to recover a great deal of the song’s context from memory. The article “Plink: ‘thin slices’ of Music” (Krumhansl, C. Music Perception [2010]:337-354) describes a study of this phenomenon. The investigator compiled a list of songs from Rolling Stone, Billboard, and Blender lists of songs plus some recent songs familiar to college students. Twenty-three college students were then exposed to 56 clips of songs. Most of these students had had musical training, and they listened to popular music for an average of 21.7 hours per week. After hearing three short clips from a song (only 400 ms in duration), the students were asked in what year each of the songs was released. The accompanying table shows the

example 16.5 the british (musical) invasion

actual release year and the average of the release years given by the students. The actual release years ranged from 1965 (The Beatles, “Help”) to 2008 (Katy Perry, “I Kissed a Girl”).

Is there a relationship between the judged and actual release year for these songs? A scatterplot of the data (Figure 16.11) suggests that there is a linear relation between these two variables, but this can be confirmed this using the model utility test.

With x 5 actual release year and y 5 judged release year, the equation of the esti-mated regression line is

y 5 1095 1 0.449x. The five-step process for hypothesis testing

can be used to carry out the model utility test.

Actual and Judged Release Years

Actual Release

Judged Release

Actual Release

Judged Release

Actual Release

Judged Release

Actual Release

Judged Release

1998 1997.2 1976 1983.3 1976 1988.0 1970 1985.41967 1973.7 2008 1995.0 2006 1996.7 1975 1985.91998 1996.3 1971 1979.8 1974 1985.4 1991 1993.31999 1993.3 1965 1976.8 2007 1999.8 2008 1995.41983 1985.4 1967 1975.0 1976 1987.2 1965 1977.61982 1988.0 1971 1978.0 1974 1977.6 1987 1990.71965 1970.2 1967 1978.0 1970 1982.8 1975 1986.31991 1992.8 1984 1983.3 1971 1976.3 1968 1986.71983 1984.1 1984 1989.8 1999 1988.5 1987 1988.01976 1979.3 1968 1976.7 1997 1994.1 2008 1990.21971 1975.4 1965 1978.5 2006 1995.4 1982 1991.11981 1984.6 1965 1977.2 1981 1989.3 1979 1983.71967 1973.7 1979 1986.7 2008 1993.7 2000 1989.82007 1997.2 1997 1996.3 1965 1981.1 2000 1991.1

85241_ch16_ptg01.indd 754 20/12/12 6:39 PM


FIGuRe 16.11 Scatterplot of judged release year versus actual release year

Process step

H Hypotheses In the model utility test, the null hypothesis is there is no useful relationship between the actual and the judged release year: H

0: b 5 0.

The alternative hypothesis specifies that there is a useful relationship: b Þ 0.

Hypotheses:

Null hypothesis: H0: b 5 0

Alternative hypothesis: Ha: b Þ 0

M Method Because the answers to the four key questions are hypothesis testing, sample data, two numerical variables in a regression setting and one sample, a hypothesis test for the slope of a population regression line will be considered.The test statistic for this test is

t 5 b 2 0

_____ sb

5 b __ s

b

The value of 0 in the test statistic is the hypothesized value from the null hypothesis.For this example, a significance level of 0.05 will be used.

Significance level:

a 5 0.05

C Check In Section 16.3, you will see how to check to see if the four assumptions of the simple linear regression model are reasonable. For this example, you can assume that these assumptions are reasonable and proceed with the model utility test.

C Calculate JMP output is shown here:

Linear Fit

Judged Release = 1095.1525 + 0.449281*Actual ReleaseSummary of Fit

Lack of Fit


Parameter Estimates


0.7710.7667593.59844

1986.01356

TermInterceptActual Release

1095.15250.449281

66.071590.033321

16.5813.48

<.0001*<.0001*

Estimate Std Error t Ratio Prob>|t|

2000

1995

1990

1985

1980

1975

1970

1960 1970 1980 1990

Actual

Judg

ed2000 2010

(continued)


sb

85241_ch16_ptg01.indd 755 20/12/12 6:39 PM


each exercise set assesses the following chapter learning objectives: c2, m3, m4 , m5, m6, P1, P2


Section 16.2 exercise set 116.13 The standard deviation of the errors, s

e, is an impor-

tant part of the linear regression model.a. What is the relationship between the value of s

e and the

value of the test statistic in a test of a hypotheses about b ?b. What is the relationship between the value of s

e and the

width of a confidence interval for b ?

16.14 A journalist is reporting about some research on appropriate amounts of sleep for people 9 to 19 years of age. In that research, a linear regression model is used to describe the relationship between alertness and number of hours of sleep the night before. The researchers reported a 95% confidence interval, but newspapers usually report an estimate and a margin of error.a. In order to calculate a margin of error from the reported

confidence interval, what additional conditions, if any, need to be verified?

b. In order to calculate a margin of error from the reported confidence interval, what additional information, if any, is needed?

16.15. A nursing student has completed his final project, and is preparing for a meeting with his project advisor. The subject of his project was the relationship between systolic blood pres-sure (SBP) and body mass index (BMI). The last time he met with his advisor he had completed his measurements, but only entered half his data into his statistical software. For the data he

had entered, the necessary conditions for inference for b were met. In a short paragraph, explain, using appropriate statistical terminology, which of the conditions below must be rechecked.

1. The standard deviation of e is the same for all values of x.2. The distribution of e at any particular x value is normal.

16.16 Consider the accompanying data on x 5 research and development expenditure (thousands of dollars) and y 5 growth rate (% per year) for eight different industries.

x 2024 5038 905 3572 1157 327 378 191y 1.90 3.96 2.44 0.88 0.37 20.90 0.49 1.01

a. Would a simple linear regression model provide useful information for predicting growth rate from research and development expenditure? Use a .05 level of significance.

b. Use a 90% confidence interval to estimate the average change in growth rate associated with a $1000 increase in expenditure. Interpret the resulting interval

16.17 The paper “the effects of split Keyboard Geometry on upper Body Postures” (ergonomics [2009]: 104–111) describes a study to determine the effects of several key-board characteristics on typing speed. One of the variables considered was the front-to-back surface angle of the keyboard. Minitab output resulting from fitting the simple linear regression model with x 5 surface angle (degrees) and y 5 typing speed (words per minute) is given below.

Test statistic:

t 5 b 2 0

_____ sb

5 0 .449 2 0

__________ 0.0333

5 13.48

Associated P-value:

P 2 value 5 twice area under t curve to the right of 13.48

5 2P(t .13.48)

ø 0

C Commu nicate results

Because the P-value is less than the selected significance level, the null hypothesis is rejected.

Decision: Reject H0.

Conclusion: The sample data provide convincing evidence that there is a useful linear relationship between the actual release year and the judged release year.

Because the model utility test confirms that there is a useful linear relationship between judged release year and actual release year, it would be reasonable to use the estimated regression model to predict the judged release year for a given song based on its actual release year. Of course, before you do this, you would also want to evaluate the accuracy of predictions by looking at the value of s

e.

When H0: b 5 0 cannot be rejected using the model utility test at a reasonably small sig-

nificance level, the search for a useful model must continue. One possibility is to relate y to x using a nonlinear model — an appropriate strategy if the scatterplot shows curvature.

85241_ch16_ptg01.indd 756 20/12/12 6:39 PM


Regression Analysis: typing speed versus surface Angle

The regression equation isTyping Speed 5 60.0 1 0.0036 Surface Angle

Predictor Coef SE Coef T PConstant 60.0286 0.2466 243.45 0.000surface Angle 0.00357 0.03823 0.09 0.931

S 5 0.511766 R-Sq 5 0.3% R-Sq(adj) 5 0.0%


source DF SS MS F PRegression 1 0.0023 0.0023 0.01 0.931Residual error 3 0.7857 0.2619total 4 0.7880

a. Suppose that the basic assumptions of the simple linear regression model are met. Carry out a hypothesis test to decide if there is a useful linear relationship between x and y.

b. Are the values of se and r2 consistent with the conclusion

from Part (a) ? Explain.

16.18 Do taller adults make more money? The authors of the paper “stature and status: Height, Ability, and labor Market outcomes” (Journal of Political Economics [2008]: 499–532) investigated the association between height and earnings. They used the simple linear regression model to describe the relationship between x 5 height (in inches) and y 5 log(weekly gross earnings in dollars) in a very large sample of men. The logarithm of weekly gross earnings was used because this transformation resulted in a relationship that was approximately linear. The paper reported that the slope of the estimated regression line was b 5 0.023 and the standard deviation of b was s

b 5 0.004 . Carry out a

hypothesis test to decide if there is convincing evidence of a useful linear relationship between height and the logarithm of weekly earnings. You can assume that the basic assump-tions of the simple linear regression model are met.

16.19 The effects of grazing animals on grasslands have been the focus of numerous investigations by ecologists. One such study, reported in “the ecology of Plants, large Mammalian Herbivores, and Drought in Yellowstone national Park” (Ecology [1992]: 2043– 2058), proposed using the simple linear regression model to relate y 5 green biomass concen-tration (g/cm3) to x 5 elapsed time since snowmelt (days).a. The estimated regression equation was given as

y 5

106.3 2 .640x. What is the estimate of average change in biomass concentration associated with a 1-day increase in elapsed time?

b. What value of biomass concentration would you predict when elapsed time is 40 days?

c. The sample size was n 5 58, and the reported value of the coefficient of determination was 0.470. What does this tell you about the linear relationship between the two variables?


16.20 Consider a test of hypotheses about, b the population slope in a linear regression model.a. If you reject the null hypothesis, b 5 0, what does this

mean in terms of a linear relationship between x and y?b. If you fail to reject the null hypothesis, b 5 0, what does

this mean in terms of a linear relationship between x and y?

16.21 Researchers studying pleasant touch sensations mea-sured the firing frequency (impulses per second) of nerves that were stimulated by a light brushing stroke on the forearm and also recorded the subject’s numerical rating of how pleasant the sensation was. The accompanying data was read from a graph in the paper “Coding of Pleasant touch by unmyelinated Afferents in Humans” (Nature Neuroscience, April 12, 2009).

Firing Frequency

Pleasantness Rating

23 0.224 1.022 1.225 1.227 1.0

Firing Frequency

Pleasantness Rating

28 2.034 2.333 2.236 2.434 2.8

a. Estimate the mean change in pleasantness rating associ-ated with an increase of 1 impulse per second in firing frequency using a 95% confidence interval. Interpret the resulting interval.

b. Carry out a hypothesis test to decide if there is convincing evidence of a useful linear relationship between firing frequency and pleasantness rating.

16.22 The largest commercial fishing enterprise in the southeastern United States is the harvest of shrimp. In a study described in the paper “long-term trawl Monitoring of White shrimp, litopenaeus setiferus (linnaeus), stocks within the ACe Basin national estuariene Research Reserve, south Carolina” ( Journal of Coastal Research [2008]:193-199), researchers monitored variables thought to be related to the abundance of white shrimp. One variable the researchers thought might be related to abundance is the amount of oxy-gen in the water. The relationship between mean catch per tow of white shrimp and oxygen concentration was described by fitting a regression line using data from ten randomly selected offshore sites. (The “catch” per tow is the number of shrimp caught in a single outing.) Computer output is shown below.

the regression equation isMean catch per tow 5 25859 1 97.2 O2 SaturationPredictor Coef SE Coef T PConstant 25859 2394 22.45 0.040O2 Saturation 97.22 34.63 2.81 0.023S 5 481.632 R-Sq 5 49.6% R-Sq(adj) 5 43.3%

a. Is there convincing evidence of a useful linear relation-ship between the shrimp catch per tow and oxygen con-centration density? Explain.

85241_ch16_ptg01.indd 757 20/12/12 6:39 PM


b. Would you describe the relationship as strong? Why or why not?

c. Construct a 95% confidence interval for b and interpret it in context.

d. What margin of error is associated with the confidence interval in Part (c)?

16.23 The authors of the paper “Decreased Brain Volume in Adults with Childhood lead exposure” (Public Library of Science Medicine [May 27, 2008]: e112) studied the relationship between childhood environmental lead exposure and a measure of brain volume change in a particular region of the brain. Data were given for x 5 mean childhood blood lead level (mg/dL) and y 5 brain volume change (BVC, in percent). A subset of data read from a graph that appeared in the paper was used to produce the accompanying Minitab output.

Regression Analysis: BVC versus Mean Blood lead levelThe regression equation isBVC 5 20.00179 2 0.00210 Mean Blood Lead Level

Predictor Coef SE Coef T PConstant 20.001790 0.008303 20.22 0.830Mean Blood lead level

20.0021007 0.0005743 23.66 0.000

Carry out a hypothesis test to decide if there is convincing evidence of a useful linear relationship between x and y. You can assume that the basic assumptions of the simple linear regression model are met.

Additional exercises

16.24 a. Explain the difference between the line y 5 a 1 bx and the line

y 5 a 1 bx.

b. Explain the difference between b and b.c. Let x* denote a particular value of the independent variable.

Explain the difference between a 1 bx* and a 1 bx*.d. Explain the difference between s and s

e.

16.25 What is the distinction between se and s

e?

16.26 The accompanying data were read from a plot (and are a subset of the complete data set) given in the article “Cognitive slowing in Closed-Head Injury” (Brain and Cognition [1996]: 429– 440). The data represent the mean response times for a group of individuals with closed-head injury (CHI) and a matched control group without head injury on 10 different tasks. Each observation was based on a different study, and used different subjects, so it is rea-sonable to assume that the observations are independent.a. Fit a linear regression model that would allow you to

predict the mean response time for those suffering a closed-head injury from the mean response time on the same task for individuals with no head injury.

b. Do the sample data support the hypothesis that there is a useful linear relationship between the mean response time for individuals with no head injury and the mean response

1.4

1.2

1

0.8

0.6

0.4

0.2

00 5 10 15 20 25 30 35

%LightAbsorption

Linear Fit

PeakPhotoVoltage = –0.082594 + 0.0446485* %LightAbsorption


0.9827310.9802640.0611170.808889

9

Peak

Phot

oVol

tage

Bivariate Fit of PeakPhotoVoltage By %LightAbsorption

Linear Fit

Summary of Fit


Parameter Estimates

TermIntercept%LightAbsorption

Estimate–0.082594

0.04464850.0490930.002237

t Ratio–1.6819.96

Prob>|t|0.1364

<.0001*

Std Error


Mean Response time

study Control CHI

1 250 3032 360 4913 475 6594 525 6835 610 9226 740 10447 880 14218 920 13299 1010 1481

10 1200 1815

16.27 The article “Photocharge effects in Dye sensitized Ag[Br,I] emulsions at Millisecond Range exposures” (Photographic Science and Engineering [1981]: 138– 144) gave the accompanying data on x 5 % light absorption and y 5 peak photovoltage.

x 4.0 8.7 12.7 19.1 21.4 24.6 28.9 29.8 30.5y 0.12 0.28 0.55 0.68 0.85 1.02 1.15 1.34 1.29

JMP output for these data is shown below.

time for individuals with CHI? Test the appropriate hypotheses using a 5 .05.

85241_ch16_ptg01.indd 758 20/12/12 6:39 PM

16.3 Checking Model Adequacy 759

Suppose that from previous evidence, anthropologists had believed that for each 1-mm increase in chord length, cra-nial capacity would be expected to increase by 20 cm 3. Do these new experimental data provide convincing evidence against this prior belief?

16.29 Suppose you are given the computer output shown. You are interested in testing the null hypothesis b 5 1.0 versus an alternative hypothesis of b > 1.0. Describe how you would use the given computer output to test these hypotheses.


Linear Fit

Summary of Fity = 5.6452776 + 0.9797401*x


Term Estimate5.64527760.9797401

Std Error1.843020.018048

t Ratio3.06

54.29

Prob>|t|0.0037*<.0001*

Intercept

0.9852890.984954

12.485250.791304

46

Lack of Fit


Parameter Estimates

x

Section 16.3 checking model AdequacySection 16.2 introduced methods for estimating and testing hypotheses about b, the slope in the simple linear regression model

y 5 a 1 bx 1 e

In this model, e represents the random deviation of a y value from the population regression line a 1 bx. The methods presented in Section 16.2 require that some assump-tions about the random deviations in the simple linear regression model be met in order for inferences to be valid. These assumptions include:

1. At any particular x value, the distribution of e is normal.2. At any particular x value, the standard deviation of e is s

e, which is constant over all

values of x (that is, se does not depend on x).

Inferences based on the simple linear regression model are still appropriate if model assumptions are slightly violated (for example, mild skew in the distribution of e). However, interpreting a confidence interval or the result of a hypothesis test when assump-tions are seriously violated can result in misleading conclusions. For this reason, it is important to be able to detect any serious violations.

residual AnalysisIf the deviations e

1, e

2, …, e

n from the population line were available, they could be exam-

ined for any inconsistencies with model assumptions. For example, a normal probability plot of these deviations would suggest whether or not the normality assumption was plau-sible. However, because these deviations are

a. What does the scatterplot suggest about the relationship between the peak photovoltage and the percent of light absorption?

b. What is the equation of the estimated regression line?c. How much of the observed variation in peak photovolt-

age can be explained by the model relationship?d. Predict peak photovoltage when percent absorption is 19.1,

and compute the value of the corresponding residual.e. The authors claimed that there is a useful linear relation-

ship between the two variables. Do you agree? Carry out a formal test.

f. Give an estimate of the average change in peak photo-voltage associated with a 1 percentage point increase in light absorption. Your estimate should convey informa-tion about the precision of estimation.

16.28 In anthropological studies, an important char-acteristic of fossils is cranial capacity. Frequently skulls are at least partially decomposed, so it is necessary to use other characteristics to obtain information about capacity. One measure that has been used is the length of the lambda-opisthion chord. The article “Vertesszollos and the Presapiens theory” (American Journal of Physical Anthropology [1971]) reported the accompanying data for n 5 7 Homo erectus fossils.

x (chord length in mm)

78 75 78 81 84 86 87

y (capacity in cm 3)

850 775 750 975 915 1015 1030

85241_ch16_ptg01.indd 759 20/12/12 6:39 PM


e1 5 y

1 2 (a 1 bx

1)

:e

n 5 y

n 2 (a 1 bx

n )

they can be calculated only if a and b are known. In practice, this will almost never be the case. Instead, diagnostic checks must be based on the residuals

y1 2

y 1 5 y

1 2 (a 1 bx

1)

:

yn 2

y n 5 y

n 2 (a 1 bx

n )

which are the deviations from the estimated regression line. When all model assumptions are met, the mean value of the residuals at any particular x value is 0. Any observation that gives a large positive or negative residual should be examined carefully for any unusual circum-stances, such as a recording error or nonstandard experimental condition. Identifying residu-als with unusually large magnitudes is made easier by inspecting standardized residuals.

Recall that a quantity is standardized by subtracting its mean value (0 in this case) and dividing by its actual or estimated standard deviation:

standardized residual 5 residual _________________________________ estimated standard deviation of residual

The value of a standardized residual tells you the distance (in standard deviations) of the corresponding residual from its expected value, 0.

Because residuals at different x values have different standard deviations (depend-ing on the value of x for that observation)1, computing the standardized residuals can be tedious. Fortunately, many computer regression programs provide standardized residuals.

example 16.6 revisiting the elkExample 16.3 introduced data on

x 5 chest girth (in cm)

and

y 5 weight (in kg)

for a sample of 19 Rocky Mountain elk. (See Example 16.3 for a more detailed description of the study.)

Inspection of the scatterplot in Figure 16.12 suggests the data are consistent with the assumptions of the simple linear regression model.

FIGuRe 16.12 Scatterplot for the elk data

350

300

250

200

150

100

90 100 110 120 130

Girth (cm)

Wei

ght (

kg)

140 150 160 170

1 The estimated standard deviation of the ith residual, yi 2

y i, is s

e Ï

________________

1 2 1 __ n 2 (x

i 2

_ x )2

________ ∑(x 2

_ x )2


85241_ch16_ptg01.indd 760 20/12/12 6:39 PM


–40 –30 –20 –10 0

Residual Standardized Residual

10 20 30 40 –2 –1 0 1 2

tAble 16.1 data, residuals, and standardized residuals for the elk data

observationGirth (cm)

xWeight (kg)

y Residualstandardized

Residual y

1 96 98 238.2661 21.92313 136.2662 105 196 34.9314 1.68004 161.0693 108 163 26.3361 20.30135 169.3364 109 196 23.9080 1.13323 172.0925 110 183 8.1522 0.38517 174.8486 114 171 214.8711 20.69477 185.8717 121 230 24.8380 1.14452 205.1628 124 225 11.5705 0.53117 213.4299 131 211 221.7203 20.99323 232.720

10 135 231 212.7436 20.58320 243.74411 137 225 224.2553 21.11135 249.25512 138 266 13.9889 0.64147 252.01113 140 241 216.5228 20.75921 257.52314 142 264 0.9655 0.04448 263.03415 157 284 220.3720 20.97540 304.37216 157 292 212.3720 20.59236 304.37217 159 300 29.8837 20.47699 309.88418 155 337 38.1397 1.81294 298.86019 162 339 20.8488 1.01967 318.151

The data, residuals, and the standardized residuals (computed using Minitab) are given in Table 16.1. For the residual with the largest magnitude, 38.1397, the standardized residual is 1.81294. That is, the residual is approximately 1.8 standard deviations above its expected value of 0. This value is not particularly unusual in a sample of this size. Also notice that for the negative residual with the largest magnitude, 238.2661, the standard-ized residual is 21.92313, still not unusual in a sample of this size. On the standardized scale, no residual here is surprisingly large.

FIGuRe 16.13 Boxplots of residuals and standardized residuals for the elk data.

Next, consider the assumption of the normality of e’s. Figure 16.13 shows box plots of the residuals and standardized residuals. The box plots are approximately symmetric and there are no outliers, so the assumption of normally distributed errors seems reasonable.

Notice that the boxplots of the residuals and standardized residuals are nearly identi-cal. While it is preferable to work with the standardized residuals, if you do not have access to a computer package or calculator that will produce standardized residuals, a plot of the unstandardized residuals should suffice.

A normal probability plot of the standardized residuals (or the residuals) is another way to assess whether it is reasonable to assume that e

1, e

2,..., e

n all come

from the same normal distribution. An advantage of the normal probability plot, shown in Figure 16.14, is that the value of each residual can be seen, which provides more information about the distribution. The pattern in the normal probability plot of the


85241_ch16_ptg01.indd 761 20/12/12 6:39 PM


2

1

0

–1

–2

2

1

0

–1

–2–40 –30

Residual Standardized residual–20 –10 0 10 20 30 40 –2 –1 0 1 2

Nor

mal

sco

re

Nor

mal

sco

re

standardized residuals and pattern in the normal probability plot of the the residuals for the elk data are reasonably straight, confirming that the assumption of normality of the error distribution is reasonable. Also notice that the pattern in both normal prob-ability plots is similar, so you don’t need to construct both—either plot could be used.

When considering linear regression, your first step should be to study the scat-terplot and a residual plot. These two plots provide important information about whether a linear model is appropriate

AP* exAM tIP

FIGuRe 16.14 Normal probability plots of residuals and standardized residuals for the elk data

Plotting the residualsA plot of the (x, residual) pairs is called a residual plot, and a plot of the (x, standardized residual) pairs is a standardized residual plot. Residual and standardized residual plots typically exhibit the same general shapes. If you are using a computer package or graphing calculator that calculates standardized residuals, the standardized residual plot is recom-mended. If not, it is acceptable to use the unstandardized residual plot instead.

A standardized residual plot or a residual plot is often helpful in identifying unusual or highly influential observations and in checking for violations of model assumptions. A desirable plot is one that exhibits no particular pattern (such as curvature or a much greater spread in one part of the plot than in another) and that has no point that is far removed from all the others. A point in the residual plot falling far above or far below the horizontal line at height 0 corresponds to a large residual, which can indicate unusual behavior, such as a recording error, a nonstandard experimental condition, or an atypical experimental subject. A point with an x value that differs greatly from others in the data set could have exerted excessive influence in determining the estimated regression line.

A standardized residual plot, such as the one pictured in Figure 16.15(a) is desirable, because no point lies much outside the horizontal band between 22 and 2 (so there is no unusually large residual corresponding to an outlying observation). There is no point far to the left or right of the others (which could indicate an observation that might greatly influ-ence the estimated line), and there is no pattern to indicate that the model should somehow be modified. When the plot has the appearance of Figure 16.15(b), the fitted model should be changed to incorporate curvature (a nonlinear model).

The increasing spread from left to right in Figure 16.15(c) suggests that the variance of y is not the same at each x value but rather increases with x. A straight-line model may still be appropriate, but the best-fit line should be obtained by using weighted least squares rather than ordinary least squares. This involves giving more weight to observations in the region exhibiting low variability and less weight to observations in the region exhibiting high variability. A specialized regression analysis textbook or a statistician should be consulted for more information on using weighted least squares.

The standardized residual plots of Figures 16.15(d) and 16.15(e) show an outlier (a point with a large standardized residual) and a potentially influential observation, respec-tively. Consider deleting the observation corresponding to such a point from the data set and refitting a line. Substantial changes in estimates and various other quantities are a signal that a more careful analysis should be carried out before proceeding.


85241_ch16_ptg01.indd 762 20/12/12 6:39 PM



FIGuRe 16.15 Examples of residual plots: (a) satisfactory plot; (b) plot suggesting that a curvilinear regression model is needed; (c) plot indicating nonconstant variance; (d) plot showing a large residual; (e) plot showing a potentially influential observation.

Standardizedresidual

–1

1

0

–2

2

(a)

x


–1

1

0

–2

2

(d)

x


–1

1

0

–2

2

(e)

x


–1

1

0

–2

2

(c)

x


–1

1

0

–2

2

(b)

x

Potentiallyinfluentialobservation

Largeresidual

x y standardized Residual x y standardized Residual13.00 213.5 20.11 22.40 218.9 21.5412.75 215.7 22.19 16.20 214.8 0.0416.70 215.5 20.36 16.70 213.6 1.2518.85 214.7 1.23 13.65 214.0 20.2816.60 216.1 20.91 13.90 212.0 21.5415.35 214.6 20.12 14.75 213.5 0.5813.90 213.4 0.34

example 16.7 Snow cover and temperature

Notice the difference be-tween an outlier (an obser-vation that is far removed from the other observa-tions in the y direction) and a potentially influential ob-servation (an observation that is far removed from the other observations in the x direction).

AP* exAM tIP

The article “snow Cover and temperature Relationships in north America and eurasia” ( Journal of Climate and Applied Meteorology [1983]: 460–469) explored the relationship between October–November continental snow cover (x, in millions of square kilometers) and December–February temperature ( y, in °C). The following data refer to Eurasia during the n 5 13 time periods (196921970, 197021971, …, 198121982):

A simple linear regression analysis described in the article included r2 5 0.52 and r 5 0.72, suggesting a significant linear relationship. This is confirmed by a model

85241_ch16_ptg01.indd 763 20/12/12 6:39 PM


utility test. The scatterplot and standardized residual plot are displayed in Figure 16.16. There are no unusual patterns, although one standardized residual, 22.19, is a bit on the large side. The most interesting feature is the observation (22.40, 218.9), corresponding to a point far to the right of the others in these plots. This observation may have had a substantial influence on the estimated regression line. The estimated slope when all 13 observations are included is b 5 20.459, and s

b 5 0.133. When the potentially influen-

tial observation is deleted, the estimate of b based on the remaining 12 observations is b 5 20.228. The change in slope is

change in slope 5 original b 2 new b

5 20.459 2 ( 2 0.288 )

5 20.231

The change expressed in standard deviations is 20.231/0.133 5 21.74. Because b has changed by substantially more than 1 standard deviation, the observation under con-sideration appears to be highly influential.

+----+----+----+

-14.5

-16.0

-17.5

-19.0 +-----------+-----------+-----------+-----------+-----------+15.0 17.5 20.0 22.5 25.012.5

SNOW

*

+----+----

-11.5

-13.0

*

**

*** * * *

*

TEMP

(a)

*

+----+----+----+

0.0

-1.0

-2.0

-3.0+-----------+-----------+-----------+-----------+-----------+

15.0 17.5 20.0 22.5 25.012.5SNOW

*

+----+----

2.0

1.0

STRESID

(b)

*

**

**

**

*

* **

*

Potentially influentialobservation

*

In addition, r2 based just on the 12 observations is only 0.13, and the t ratio for testing b 5 0 is not significant. Evidence for a linear relationship is much less conclusive in light of this analysis. The investigators should seek a climatological explanation for the influen-tial observation and collect more data, which could be used to find a more useful model.

FIGuRe 16.16 Plots for the data of Example 16.7: (a) Scatter plot; (b) Standardized residual plot


85241_ch16_ptg01.indd 764 20/12/12 6:39 PM


The paper “Physiological Characteristics and Performance of top u.s. Biathletes” (Medicine and Science in Sports and Exercise [195]: 1302–1310) describes a study of the relationship between cardiovascular fitness (as measured by time to exhaustion running on a treadmill) and performance on a 20-kilometer ski race. Data on

x 5 treadmill time to exhaustion (in minutes)

and

y 5 20-km ski time (in minutes)

for 11 athletes are shown in Table 16.2. Standardized residuals and residuals are also given. Is it reasonable to use the given data to construct a confidence interval or test hypotheses about b, the average change in ski time associated with a 1-min increase in treadmill time? It depends on whether the assumptions that the distribution of the deviations from the population regression line at any fixed x is approximately normal and that the variance of this distribution does not depend on x are reasonable. Constructing a normal probability plot of the standardized residuals and a standardized residual plot will provide insight into whether these assumptions are in fact reasonable.

tAble 16.2 data, residuals, and Standardized residuals for example 16.8

observation treadmill ski time Residualstandardized

Residual

1 7.7 71.0 0.172 0.10

2 8.4 71.4 2.206 1.133 8.7 65.0 3.494 1.744 9.0 68.7 0.906 0.445 9.6 64.4 1.994 0.966 9.6 69.4 3.006 1.447 10.0 63.0 2.461 1.188 10.2 64.6 0.394 0.199 10.4 66.9 2.373 1.16

10 11.0 62.6 0.527 0.2711 11.7 61.7 0.206 0.12

22

0

1

21

22

21 0 1 2

Normal score

Standardized residual

(a)

8

0

1

21

22

9 10 11 12

Treadmill time


(b)

FIGuRe 16.17 Plots for Example 16.8 (a) Normal probability plot of standardized residuals; (b)Standardized residual plot

Don’t forget to check as-sumptions. If you are used to checking assumptions before doing much in the way of calculation, it is sometimes easy to forget to check them in a regression setting. Be sure to step back and think about whether the four basic assumptions of the linear regression model are reasonable be-fore making inferences about the population slope or using the estimated model to make predictions.

AP* exAM tIP

example 16.8 treadmill time and Ski time

Figure 16.17 shows a normal probability plot of the standardized residuals and a stan-dardized residual plot. The normal probability plot is quite straight, and the standardized residual plot does not show evidence of any patterns or of increasing spread.


85241_ch16_ptg01.indd 765 20/12/12 6:39 PM


example 16.9 A new pediatric tracheal tube

The article “Appropriate Placement of Intubation Depth Marks in a new Cuffed, Paediatric tracheal tube” (British Journal of Anaesthesia [2004]: 80-87) describes a study of the use of tracheal tubes in newborns and infants. Newborns and infants have small trachea, and there is little margin for error when inserting tracheal tubes. Using X-rays of a large number of children aged 2 months to 14 years, the researchers examined the relationships between appropriate trachea tube insertion depth and other variables such as height, weight, and age. A scatterplot and a standardized residual plot con-structed using data on the insertion depth and height of the children (both measured in cm) are shown in Figure 16.18.

Residual plots like the ones pictured in Figure 16.18(b) are desirable. No point lies much outside the horizontal band between 22 and 2 (so there are no unusually large residuals corresponding to outliers). There is no point far to the left or right of the others (no observation that might be influential), and there is no pattern of curvature or differ-ences in the variability of the residuals for different height values to indicate that the model assumptions are not reasonable.

But consider what happens when the relationship between insertion depth and weight is examined. A scatterplot of insertion depth and weight (kg) is shown in Figure 16.19(a), and a standardized residual plot in Figure 16.19(b). While some curvature is evident in the original scatterplot, it is even more clearly visible in the standardized residual plot. A careful inspection of these plots suggests that along with curvature, the residuals may be more variable at larger weights. When plots have this curved appearance and increasing variability in the residuals, the linear regression model is not appropriate.

203

2

1

0

–1

–2

–350 75 100 125 150 175 200

18

16

14

12

10

50 75 100 125

Height

(a) (b)

Height

Stan

dard

ized

res

idua

l

Inse

rtio

n de

pth

150 175 200

FIGuRe 16.18 (a) Scatterplot for insertion depth vs. height data of Example 16.9; (b) standardized residual plot.

24 3

2

1

0

–1–1

–2

–3

22

20

18

16

14

12

10

0 10 20 30 40 50

Weight

(a)

Weight

(b)

Stan

dard

ized

res

idua

l

60 70 80 90 0 10 20 30 40 50 60 70 80 90

Inse

rtio

n de

pth

FIGuRe 16.19 (a) Scatterplot for insertion depth vs. weight data of Example 16.9; (b) standardized residual plot.


85241_ch16_ptg01.indd 766 20/12/12 6:39 PM


Unless otherwise noted, all content on this page is © Cengage Learning.Unless otherwise noted, all content on this page is © Cengage Learning.

35

30

25

20

15

10

5

0

0 10 20 30 40

Number of males

(a)

Number of males

(b)

Num

ber

of f

emal

es

Res

idua

ls

50 60 70 80 90 0 10 20 30 40 50 60 70 80 90

15

10

5

0

–5

–10

each exercise set assesses the following chapter learning objectives: m2, m7


Section 16.2 exercise set 116.30 The following graphs are based on data from an experiment to assess the effects of logging on a squirrel population in British Columbia (“effects of logging Pattern and Intensity on squirrel Demography,” The Journal of Wildlife Management [2007]: 2655–2663). Plots of land, each nine hectares in area, were subjected to different percentages of logging, and the squirrel population density for each plot was measured after 3 years. The scatterplot, residual plot, and a boxplot of the residuals are shown here.

example 16.10 looking for love in all the right... trees?Treefrogs’ search for mating partners was the examined in the article, “the Cause of Correlations Between nightly numbers of Male and Female Barking treefrogs (Hyla gratiosa) Attending Choruses” (Behavioral Ecology [2002: 274–281). A lek, in the world of animal behavior, is a cluster of males gathered in a relatively small area to exhibit courtship displays. The “female preference” hypothesis asserts that females will prefer larger leks over smaller leks, presumably because there are more males to choose from. The scatterplot and residual plot in Figure 16.20 show the relationship between the number of females and the number of males in observed leks of barking treefrogs. You can see that the unequal variance, which is noticeable in the scatterplot, is even more evident in the residual plot. This indicates that the assumptions of the linear regression model are not reasonable in this situation.

FIGuRe 16.20 (a) Scatterplot for treefrog data of Example 16.20; (b) residual plot

17.5

15.0

12.5

10.0

7.5

5.0

3

2

1

0

–1

–2

–3

0

0

–3 –2 –1 0 1 2 3

10 20 30 40 50 60 70

10 20 30

%Logged

%Logged

Residual

Squi

rrel

s pe

r pl

otR

esid

ual

40 50 60 70

17.5

15.0

12.5

10.0

7.5

5.0

3

2

1

0

–1

–2

–3

0

0

–3 –2 –1 0 1 2 3

10 20 30 40 50 60 70

10 20 30

%Logged

%Logged

Residual

Squi

rrel

s pe

r pl

otR

esid

ual

40 50 60 70

85241_ch16_ptg01.indd 767 20/12/12 6:39 PM


Does it appear that the assumptions of the simple linear regression model are plausible? Explain your reasoning in a few sentences.

16.31 The clutch size (number of eggs laid) for turtles is known to be influenced by body size, latitude, and average environmental temperature. Researchers gath-ered data on Gopher tortoises in Okeeheelee County Park in Florida to further understand the factors that affect reproduction in these animals (“Geographic Variation in Body and Clutch size of Gopher tortoises,” Copeia [2007]: 355–363). The scatterplot, residual plot, and a normal probability plot of the residuals for the least squares regression line with x 5 body length and y 5 clutch size are shown here.

Does it appear that the assumptions of the simple linear regression model are plausible? Explain your reasoning in a few sentences. 16.32 Carbon aerosols have been identified as a con-

tributing factor in a number of air quality problems. In a chemical analysis of diesel engine exhaust, x 5 mass (mg/cm2) and y 5 elemental carbon (mg/cm2) were recorded (“Comparison of solvent extraction and thermal optical Carbon Analysis Methods: Application to Diesel Vehicle exhaust Aerosol” Environmental Science Technology [1984]: 231– 234). The estimated regression line for this data set is

y 5 31 1 .737x. The accompanying table gives the

observed x and y values and the corresponding standard-ized residuals.

x 164.2 156.9 109.8 111.4 87.0y 181 156 115 132 96St. resid. 2.52 0.82 0.27 1.64 0.08

x 161.8 230.9 106.5 97.6 79.7y 170 193 110 94 77St. resid. 1.72 20.73 0.05 20.77 21.11

x 118.7 248.8 102.4 64.2 89.4y 106 204 98 76 89St. resid. 21.07 20.95 20.73 20.20 20.68

x 108.1 89.4 76.4 131.7 100.8y 102 91 97 128 88St. resid. 20.75 20.51 0.85 0.00 21.49

x 78.9 387.8 135.0 82.9 117.9y 86 310 141 90 130St. resid. 20.27 20.89 0.91 20.18 1.05

a. Construct a standardized residual plot. Are there any unusually large residuals? Do you think that there are any influential observations?

b. Is there any pattern in the standardized residual plot that would indicate that the simple linear regression model is not appropriate?

c. Based on your plot in Part (a), do you think that it is reasonable to assume that the variance of y is the same at each x value? Explain.

16.33 The article “Vital Dimensions in Volume Perception: Can the eye Fool the stomach?” ( Journal of Marketing

1.64

1.28

0.67

–0.67

–1.28

–1.64

0.0

0.05

0.1

0.2

0.3

0.5

0.7

0.8

0.9

0.95

Nor

mal

Qua

ntile

Plo

t

–8 –6 –4 –2 0 2 4 6

6

–6

–8

4

–4

2

–2

0

Res

idua

ls

4

280 290 300 310 320 330 3400

2

6

8

10

12

14

Length(mm)

Clu

tchS

ize


85241_ch16_ptg01.indd 768 20/12/12 6:39 PM


Flowering Range versus elevation: Tussilago Farfara

elevation (Meters Above sea level)

Flowering Date Range

23.3 33.45.6 32.0

55.6 31.9140.0 31.331.1 28.1

112.2 29.3106.7 28.442.2 26.675.6 24.9

176.7 25.7126.7 24.7126.7 23.5176.7 23.2201.1 21.8133.3 22.390.0 21.441.1 19.7

125.6 17.6477.8 17.6

ProductMaximum Width

(cm)Minimum Width

(cm)

1 2.50 1.802 2.90 2.703 2.15 2.004 2.90 2.605 3.20 3.156 2.00 1.807 1.60 1.508 4.80 3.809 5.90 5.00

10 5.80 4.7511 2.90 2.8012 2.45 2.1013 2.60 2.2014 2.60 2.6015 2.70 2.6016 3.10 2.9017 5.10 5.1018 10.20 10.2019 3.50 3.5020 2.70 1.2021 3.00 1.7022 2.70 1.7523 2.50 1.7024 2.40 1.2025 4.40 1.2026 7.50 7.5027 4.25 4.25

a. Fit the simple linear regression model that would allow prediction of the maximum width of a food container based on its minimum width.

b. Calculate the standardized residuals (or just the residu-als if you don’t have access to a computer program that gives standardized residuals) and make a residual plot to determine whether there are any outliers.

c. The data point with the largest residual is for a 1-liter Coke bottle. Delete this data point and refit the regres-sion. Did deletion of this point result in a large change in the equation of the estimated regression line?

d. For the regression line of Part (c), interpret the estimated slope and, if appropriate, the intercept.

e. For the data set with the Coke bottle deleted, do you think that the assumptions of the simple linear regression model are reasonable? Give statistical evidence for your answer.

16.34 Models of climate change predict that global tem-peratures and precipitation will increase in the next 100

15

20

25

30

35

0 100 200 300 400 500

Elevation

Flow

erin

gda

te r

ange

Bivariate Fit of Flowering Date Range by Elevation

a. Fit a linear regression model using all 19 observations. What are the values of a, b, r2, s

e?

b. Fit a linear regression model with the indicated point omitted. What are the values of a, b, r2, s

e?


Research [1999]: 313– 326) gave the accompanying data on the dimensions of 27 representative food products (Gerber baby food, Cheez Whiz, Skippy Peanut Butter, and Ahmed’s tandoori paste, to name a few).

years, with the largest changes occurring during winter in northern latitudes. Researchers gathered data on the potential effects of climate change for flowering plants in Norway. (“Climatic Variability, Plant Phenology, and northern ungulates,” Ecology [1999]: 1322–1339). The table below gives data for one flower species. Range of flowering dates and elevation for different sites in Norway were used to construct the given scatterplot. A potentially influential point is indicated on the scatterplot.

85241_ch16_ptg01.indd 769 20/12/12 6:39 PM


16.37 The accompanying scatterplot, based on 34 sedi-ment samples with x 5 sediment depth (cm) and y 5 oil and grease content (mg/kg), appeared in the article “Mined land Reclamation using Polluted urban navigable Waterway sediments” ( Journal of Environmental Quality [1984]: 415–422). Discuss the effect that the observation (20, 33,000) will have on the estimated regression line. If this point were omitted, what do you think will happen to the slope of the estimated regres-sion line compared to the slope when this point is included?

c. In a few sentences, describe any differences you found in Parts (a) and (b).

d. The researchers could use the estimated regression equa-tion based on all 19 observations to make predictions for elevations ranging from 0 to 200 meters; or they could use the estimated regression equation based on the 18 observations (omitting the observation identified by an arrow) to make predictions for elevations ranging from 0 to 500 meters. Which strategy would you recommend, and why?


16.35 In the study described in Exercise 16.31, the effect of latitude on mean clutch size was investigated. Data from various locations in Florida, Georgia, Alabama, and Mississippi on y 5 mean clutch size and x 5 latitude were measured. The scatterplot, standardized residual plot, and several graphs of the standardized residuals are shown below.


16.36 Exercise 6.21 gave data on x 5 nerve firing frequency and y 5 pleasantness rating when nerves were stimulated by a light brushing stoke on the forearm. The x values and the corresponding residuals from a simple linear regression are as follows:a. Construct a standardized residual plot. Does the plot

exhibit any unusual features?

4

3

2

1

0

Residual

Freq

uenc

y

–2.0 –1.5 1.5–1.0 1.00.5–0.5 0.0

Histogram of the Residuals

8

7

6

5

4

26 27 28 29 30 31 32 33

26 27 28 29 30 31 32 33

Mea

n C

lutc

h Si

ze

Latitude

Latitude

2

1

0

–1

–2

Stan

dard

ized

Res

idua

l

99

90

50

10

1–2 –1 0 1 2

Residual

Perc

ent

Normal Probability plot of the Residuals

8

7

6

5

4

26 27 28 29 30 31 32 33

26 27 28 29 30 31 32 33

Mea

n C

lutc

h Si

ze

Latitude

Latitude

2

1

0

–1

–2

Stan

dard

ized

Res

idua

l


22.0

0

1

2

21

22

21.021.5 20.5 0.0 0.5 1.0 1.5


Nor

mal

sco

re

Firing Frequency, x standardized Residual

23 21.8324 0.0422 1.4525 0.2027 21.0728 1.1934 20.2433 20.1336 20.8134 1.17

b. A normal probability plot of the standardized residuals follows. Based on this plot, do you think it is reasonable to assume that the error distribution is approximately normal? Explain.

85241_ch16_ptg01.indd 770 20/12/12 6:39 PM


04,000

8,000

12,000

16,000

20,000

24,000

28,000

32,000

30 60 90 120 150 180

Oil and grease(mg/kg)

Subsample mean depth (cm)

16.38 Investigators in northern Alaska periodically moni-tored radio collared wolves in 25 wolf packs over 4 years, keeping track of the packs’ home ranges. (“Population Dynamics and Harvest Characteristics of Wolves in the Central Brooks Range, Alaska,” Wildlife Monographs, [2008]: 1–25). The home range of a pack is the area typically covered by its members in a specified amount of time. The investigators noticed that wolf packs with larger home ranges tended to be located more often by monitoring equipment. The inves-tigators decided to explore the relationship between home range and the number of locations per pack. A scatterplot and standardized residual plot of the data are shown below, as well as plots of the standardized residuals.


2500

2000

1500

1000

500

0

0

20

20

80

80

40

40

60

60

100

100

Hom

e R

ange

Locations/Pack

Locations/Pack

3

1

–1

2

–2

0

Stan

dard

ized

Res

idua

l

7

6

5

4

3

2

20

1

1–10

Freq

uenc

y

Standardized Residual


Additional exercises16.39 Carbon acrosols have been identified as a contribut-ing factor in a number of air quality problems. In a chemical analysis of diesel engine exhaust, x 5 mass (mg/cm2) and y 5 elemental carbon (mg/cm2) were recorded ("Comparison of solvent extraction and thermal optical Carbon Analysis Methods: Application to Diesel Vehicle exhaust Aerosol" Environmental science Technology [1984]: 231–234). The esti-mated regression line for this data set is

y 5 31 1 .737x.

A scatterplot of the data and a standardized residual plot are shown below.

2500

2000

1500

1000

500

0

0

20

20

80

80

40

40

60

60

100

100

Hom

e R

ange

Locations/Pack

Locations/Pack

3

1

–1

2

–2

0

Stan

dard

ized

Res

idua

l

7

6

5

4

3

2

20

1

1–10

Freq

uenc

y

Standardized Residual

300

300 350 400

250

250

200

200mass

carb

on

150

150

100

10050

50

Bivariate Fit of carbon By mass

300 350 400

2

3

250200mass

St. R

esid

uals

1

0

150

–1

100–2

50

85241_ch16_ptg01.indd 771 20/12/12 6:39 PM


a. Fit a linear regression model using all 20 observations. What are the values of a, b, r2 and s

e?

b. Fit a linear regression model with the two observations identified by arrows omitted. What are the values of a, b, r2 and s

e?

c. In a few sentences, describe any differences you found in Parts (a) and (b).

d. The researchers could use the estimated regression equation based on all 20 observations to make predic-tions for latitudes ranging from 58 to 64, or they could use the estimated regression equation based on the 18 observations (omitting the two observations identified by arrows) to make predictions for latitudes ranging from 58 to 62. Which strategy would you recommend, and why?

16.41 The sand scorpion is a predator that always hunts from a motionless resting position outside its own burrow. When prey appears on the horizon, within say 20 cm, the scorpion assumes an alert posture; it determines the angular position of the prey, makes a quick rotation, and runs after it. In a recent study of the scorpion’s accuracy, the angular posi-tion (0 degrees 5 right in front) of the prey, and the turning angle of the scorpion was recorded for 23 attacks. A simple regression model relating the response angle of the predator to the target angle position of the prey,

r 5 a 1 b(t), was fit.

The resulting residual plot is shown. Describe the locations of any outliers you see in the residual plot.

40

30

20

10

0

–10

–20

–30

–40200–200 –150 150–100 100–50 500

Res

idua

l

Target Angle

a. Are there any unusually large residuals? Do you think that there are any influential observations?

b. Is there any pattern in the standardized residual plot that would indicate that the simple linear regression model is not appropriate?

c. Based on the scatterplot and the standardized residual plot, do you think that it is reasonable to assume that the variance of y is the same at each x value? Explain.

16.40 Models of climate change predict that global temperatures and precipitation will increase in the next 100 years, with the largest changes occurring during winter in northern latitudes. Researchers recently gath-ered data on the potential effects of climate change for flowering plants in Norway. (“Climatic Variability, Plant Phenology, and northern ungulates,” Ecology [1999]: 1322-1339). The table below gives data for one flower spe-cies. A scatterplot of the “range of flowering dates” versus latitude for different sites in Norway is also shown. Two points that are potentially influential are indicated on the scatterplot.

50

45

40

Mea

n Fl

ower

ing

date

ran

ge

35

30

25

20

15

1058 59 60

Latitude (N)

62 63 6461

Flowering RangeVersus latitude: Anemone Hepatica

latitude (n)Flowering

Date Range

60.9 26.863.4 28.763.4 19.260.5 22.560.7 17.960.7 12.961.1 11.8

Flowering RangeVersus latitude: Anemone Hepatica

latitude (n)Flowering

Date Range58.7 46.158.2 35.958.2 34.759.4 32.360.0 33.059.4 29.759.1 26.959.3 26.259.5 25.659.5 27.659.7 19.159.8 24.460.8 26.2

(continued)


85241_ch16_ptg01.indd 772 20/12/12 6:39 PM


16.42 The production of pups and their survival are the most significant factors contributing to gray wolf population growth. The causes of early pup mortality are unknown, and difficult to observe. The pups are concealed within their dens for 3 weeks after birth, and after they emerge it is difficult to confirm their parentage. Researchers recently used portable ultrasound equipment to investigate some factors related to reproduction. (“Diagnosing Pregnancy, in utero litter size, and Fetal Growth with ultrasound in Wild, Free-Ranging Wolves,” Journal of Mammology [2006]: 85–92)

A scatterplot and linear regression of the length of a wolf fetus (in cm, measured from crown to rump) and gesta-tional age (in days) is shown below. Identify the point that has the largest residual by giving its approximate coordinates.

5

4

3

2

1

0

Cro

wn

– ru

mp(

cm)

25 30 35 40

Gest Age(days)

16.43 The authors of the article “Age, spacing and Growth Rate of tamarix as an Indication of lake Boundary Fluctuations at sebkhet Kelbia, tunisia” ( Journal of Arid Environments [1982]: 43– 51) used a simple linear regres-sion model to describe the relationship between y 5 vigor (average width in centimeters of the last two annual rings) and x 5 stem density (stems/m2). The estimated model was based on the following data. Also given are the standardized residuals.

x 4 5 6 9 14y 0.75 1.20 0.55 0.60 0.65St. resid. 20.28 1.92 20.90 20.28 0.54

x 15 15 19 21 22y 0.55 0.00 0.35 0.45 0.40St. resid. 0.24 22.05 20.12 0.60 0.52

a. What assumptions are required for the simple linear regression model to be appropriate?

b. Construct a normal probability plot of the standardized residuals. Does the assumption that the random deviation distribution is normal appear to be reasonable? Explain.

c. Construct a standardized residual plot. Are there any unusually large residuals?

d. Is there anything about the standardized residual plot that would cause you to question the use of the simple linear regression model to describe the relationship between x and y?

16.44 (C1)Describe what distinguishes a deterministic model from a probabilistic model.

16.45 (C2)In the context of the simple linear regression model, explain the difference between a and a. Between b and b. Between s

e and s

e.

16.46 (M1)The SAT and ACT exams are often used to predict a student’s first-term college grade point average (GPA). Different formulas are used for different colleges and majors. Suppose that a student is applying to State U with an intended major in civil engineering. Also suppose that for this college and this major, the following model is used to predict first term GPA.

GPA 5 a 1 b (ACT )

a 5 0.5

b 5 0.1

a. In this context, what would be the appropriate interpreta-tion of a?

b. In this context, what would be the appropriate interpreta-tion of b?

16.47 (M2)Theropods were carnivorous dinosaurs, characterized by short forelimbs, living in the Jurassic and Cretaceous peri-ods. (Tyrannosaurus rex is classified as a Theropod.) What scientists know about therapods is based on studying incom-plete skeletal remains. In a study described in the paper “My theropod is Bigger than Yours…or not: estimating Body size from skull length in theropods” ( Journal of Vertebrate

All chapter learning objectives are assessed in these exercises. the learning objectives assessed in each exercise are given in parentheses.

Are You reAdY to move on? chAPter 16 review exerciSeS


85241_ch16_ptg01.indd 773 20/12/12 6:39 PM


16.48 (M3)There are 4 basic assumptions necessary for making infer-ences about b, the slope of the population regression line.a. What are the four assumptions?b. Which assumptions can be checked using sample data?c. What statistics or graphs would be used to check each of

the assumptions you listed in Part (b)?

16.49 (M3, M4, M5, P1, P2)Ruffed grouse are a species of birds that nest on the ground. Because of this, chick survival at night in the first few weeks of life depends on avoiding predators. Biologists have theorized that protection from predators might be supplied by the mother hen’s choice of brooding sites. One variable that biologists thought might be related to survival is the density of vegetation in the vicinity of the nest. Dense vegetation would possible reduce the ability of predators to detect the nests. The paper “nocturnal Roost Habitat selection by Ruffed Grouse Broods ( Journal of Field Ornithology [2005]:168–174) describes a study in which

–2 –1.5 –1 –0.5 0 0.5 1 1.5

Nor

mal

Qua

ntile

Plo

t

0.95

0.9

0.8

0.5

0.2

0.1

0.05

1.64

1.28

–1.28

0.67

–0.67

0.0

–1.64

14

12

10

8

6

4

2

00 0.25 0.5

Bod

yLen

gth

0.75

SkullLength

1 1.25 1.5

Bivariate Fit of BodyLength By SkullLength

2

1.5

1

0.5

0

–0.5

–1

–1.5

–20 0.25 0.5 0.75 1 1.25 1.5

SkullLength

Res

idua

ls

Bivariate Fit of Residuals By SkullLength

Linear Fit

Summary of Fit

RSquareRSqureAdjRoot Mean Square ErrorMean of ResponseObservations(or Sum Wgts)

Analysis Of Variance

Parameter Estimates

Term Estimate Std Error t Ratio Prob>|t|InterceptSkullLength

0.70610887.791973

0.3304850.415318

2.1418.76

0.0475*<.0001*

0.9539290.9512180.8010425.859474

19

BodyLength = 0.7061088 + 7.791973*SkullLength


Paleontology [2007]: 108–115), researchers used data from skeletons to develop a model describing the relationship between body length and skull length. JMP was used to produce the following graphical displays and computer output. When you evaluate the fit of an estimated regression line, all of the information below is considered as a whole. However, the summary statistics in the computer output and the different plots each convey some specific information.a. Using only the scatterplot, do you think a linear model

does a good job of describing the relationship? Explain why or why not.

b. Using only the residual plot, what can you determine about whether the basic assumptions of the linear regression model are met?

c. Using only the normal probability plot and boxplot of the residuals, what can you determine about whether the basic assumptions of the linear regression model are met?

d. Using only the values of r2 and se, what can you say about

the quality of the fit of the linear model for these data?

85241_ch16_ptg01.indd 774 20/12/12 6:39 PM

775

researchers monitored the survival of the brood (number of chicks surviving /number of eggs hatched) in 23 nests in different vegetation densities (thousands of stems / hectare.)

Computer output (from JMP) is shown below.

Linear Fit

Summary of Fit


Parameter Estimates

Term Estimate Std Error t Ratio Prob>|t|InterceptStemDensity

0.9468008–0.02619

0.2351080.011657

4.03–2.25

0.0006*0.0355*

0.1937880.1553970.2875380.436043

23

BroodSurvival = 0.9468008 – 0.0261902*StemDensity

a. Is there convincing evidence of a useful linear rela-tionship between brood survival and stem density? Explain.

b. Would you describe the relationship as strong? Why or why not?

c. Construct a 95% confidence interval for b and interpret it in context.

d. What margin of error is associated with the confidence interval in part (c)?

16.50 (M7)Researchers in Hawaii have recently documented a large increase in the prevalence of a bird parasite known as chew-ing lice. (“explosive Increase in ectoparasites in Hawaiian Forest Birds,” The Journal of Parasitology [2008]: 1009–1021). Current data suggest that the prevalence of chewing lice may be less for bird species with a high degree of bill overhang. A species is said to have bill overhang when the upper bill extends downward in front of the end of the lower bill. The following scatterplot shown shows the relationship between the prevalence of chewing lice and bill overhang for 8 bird species in the Hawaiian Islands. A residual plot is also shown. Use these plots to identify any outliers or potentially influential observations. For each point you identify, assess its influence on the estimated slope of the regression line.

Technology Notes

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.3

–0.3

–0.4

0.0

0.0

0.1

0.0

–0.1

0.2

0.2

0.2

–0.2

0.4

0.4 0.6

0.6

0.8

0.8

1.0

1.0

Lic

e Pr

eval

ence

Bill Overhang

Bill Overhang

Res

idua

l

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.3

–0.3

–0.4

0.0

0.0

0.1

0.0

–0.1

0.2

0.2

0.2

–0.2

0.4

0.4 0.6

0.6

0.8

0.8

1.0

1.0

Lic

e Pr

eval

ence

Bill Overhang

Bill Overhang

Res

idua

l

16.51 (M6)Suppose you are given the computer output shown. You want to test the hypothesis, b 5 1.0. Describe how you would use the computer output to test this hypothesis

y = 5.6452776 + 0.9797401*x

Linear Fit

Summary of Fit


Parameter Estimates

Term Estimate Std Error t Ratio Prob>|t|Interceptx

5.64527760.9797401

1.843020.018048

3.0654.29

0.0037*<.0001*

0.9852890.984954

12.485250.791304

46

technologY noteS

Regression test

tI-83/841. Enter the data for the independent variable into l1 (In order

to access lists press the stAt key, highlight the option called edit… then press enteR)

2. Enter the data for the dependent variable into l23. Press stAt4. Highlight tests5. Highlight linRegttest… and press enteR6. Next to b & r select the appropriate alternative hypothesis7. Highlight Calculate


85241_ch16_ptg01.indd 775 20/12/12 6:39 PM

776776 chAPter 16 Understanding Relationships—Numerical Data Part 2

tI-nspire1. Enter the data into two separate data lists (In order to access

data lists select the spreadsheet option and press enter)note: Be sure to title the lists by selecting the top row of the column and typing a title.

2. Press the menu key and select 4:stat tests then 4:stats tests then A:linear Reg t test… and press enter

3. In the box next to x list choose the list title where you stored your independent data from the drop-down menu

4. In the box next to Y list choose the list title where you stored your dependent data from the drop-down menu

5. In the box next to Alternate Hyp choose the appropriate alternative hypothesis from the drop-down menu

6. Press oK

JMP1. Input the data for the dependent variable into the first

column2. Input the data for the independent variable into the second

column3. Click Analyze and select Fit Y by x4. Select the dependent variable (Y) from the box under select

Columns and click on Y, Response5. Select the independent variable (X) from the box under

select Columns and click on x, Factor6. Click the red arrow next to Bivariate Fit of… and select

Fit line

MInItAB1. Input the data for the dependent variable into the first

column2. Input the data for the independent variable into the second

column3. Select stat then Regression then Regression…4. Highlight the name of the column containing the dependent

variable and click select5. Highlight the name of the column containing the indepen-

dent variable and click select6. Click oK

note: You may need to scroll up in the Session window to view the t-test results for the regression analysis.

sPss1. Input the data for the dependent variable into one column2. Input the data for the independent variable into a second

column3. Click Analyze then click Regression then click linear…4. Select the name of the dependent variable and click the

arrow to move the variable to the box under Dependent:5. Select the name of the independent variable and click the

arrow to move the variable to the box under Independent(s):6. Click oK

note: The p-value for the regression test can be found in the Coefficients table in the row with the independent variable name.

excel1. Input the data for the dependent variable into the first column2. Input the data for the independent variable into the second

column3. Select Analyze then choose Regression then choose

linear…4. Highlight the name of the column containing the dependent

variable5. Click the arrow button next to the Dependent box to move

the variable to this box6. Highlight the name of the column containing the indepen-

dent variable7. Click the arrow button next to the Independent box to move

the variable to this box8. Click oK

note: The test statistic and p-value for the regression test for the slope can be found in the third table of output. These values are listed in the row titled with the independent variable name and the columns entitled t Stat and P-value.

85241_ch16_ptg01.indd 776 20/12/12 6:39 PM

AP* Review Questions for Chapter 16Use the following information for questions 1–6.

A study was carried out to investigate the relationship be-tween x 5 the number of components needing repair and y 5 the time of the service call (in minutes) for a computer repair company. The number of components and the service time for a random sample of 20 service calls was used to fit a simple linear regression model. Partial computer output is shown below.

The regression equation isTime 5 37.2 1 9.97 NumberPredictor Coef SE Coef T PConstant 37.213 7.985 4.66 0.000Number 9.9695 0.7218 13.81 0.000S 5 18.7534 R-Sq 5 89.7% R-Sq(adj) 5 89.2%

1. Which of the following statements is a correct interpre-tation of the value 9.97?

(A) The average number of components needing repair goes up 9.97 for each 1 minute increase in the ser-vice time of a call.

(B) On average, the service call time goes up 9.97 minutes for each additional component needing repair.

(C) The service call time is 9.97 minutes when there are 0 components to repair.

(D) Approximately 9.97% of the observed variation in the service call times can be explained by the linear relationship between service time and number of components requiring repair.

(E) If this regression equation is used to predict service call times, we can expect predictions to be within 9.97 minutes of the actual time.

2. Which of the following statements is a correct interpre-tation of the value 89.7%?

(A) On average, the service call time goes up 89.7 minutes for each additional component needing repair.

(B) The magnitude of a typical difference between an observed service call time and the service call time predicted by the linear model is approximately 89.7 minutes.

(C) The correlation between service call time and num-ber of components needing repair is 89.7%.

(D) Approximately 89.7% of the observed variation in service call time can be explained by the linear re-lationship between service call time and number of components needing repair.

(E) If this regression equation is used to predict service call times, we can expect predictions to be within 89.7 minutes of the actual time.

3. The value of se is 18.75. Which of the following is an

appropriate interpretation of this value?

(A) 18.75% of the variability in service time can be ex-plained by the linear relationship between service call time and number of components needing repair.

(B) There is a positive correlation between service call time and number of components needing repair.

(C) For every 1-component increase in the number of components needing repair, the predicted service call time increases by about 18.75 minutes.

(D) The magnitude of a typical difference between an ob-served service time and the service call time predicted by the linear model is approximately 18.75 minutes.

(E) The average service call time is 18.75 minutes.

4. The value of se is 18.75. If the assumptions of the

simple linear regression model are satisfied, which of the following is correct?

(A) The width of a 95% confidence interval for the slope of the population regression line is 2(18.75) 5 37.50.

(B) It would be unlikely that a prediction based on the regression line will be greater than 18.75 minutes.

(C) It would be unlikely that a prediction based on the regression line will differ from the actual value by more than 2(18.75) 5 37.50 minutes.

(D) Errors associated with predictions based on the re-gression line will always be less than 18.75 minutes.

(E) The value of se does not provide any information

about the anticipated magnitude of prediction errors.

5. Which of the following is a 95% confidence interval for the change in service time associated with a 1-unit increase in the number of components needing repair?

(A) 37.21 6 (1.96)(7.985)(B) 37.21 6 (2.910)(7.985)(C) 9.97 6 (1.96)(0.7218)(D) 9.97 6 (2.10)(0.7218)(E) 9.97 6 2(18.7534)

6. If the basic assumptions of the simple linear regression model are reasonable, what conclusion should be reached regarding model utility if a significance level of 0.05 is used for the model utility test?

(A) There is convincing evidence of a negative linear relationship between service call time and number of components needing repair.

(B) There is convincing evidence that the model is not useful for predicting service call time.

(C) There is convincing evidence that the model is use-ful for predicting service call time.

(D) There is not convincing evidence that the model is useful for predicting service call time.

(E) A conclusion cannot be reached based on the given information.

AP* and the Advanced Placement Program are registered trademarks of the College Entrance Examination Board, which was not involved in the produc-tion of, and does not endorse, this product.

777 Review Questions

85241_ch16_ptg01.indd 777 20/12/12 6:39 PM

778

7. If there is a positive linear relationship between two variables x and y, which of the following must be true of b, the slope of the population regression line?

(A) b , 0 (B) b . 0(C) b 5 0(D) b . 1(E) 21 , b , 1

8. The plots shown are residual plots resulting from fitting a linear regression. Which of these plots indicates that the relationship between the two variables used to fit the line may not be linear?

22

2


x

0

1

210 25 0 105

21

21.5

2.0


x

0.0

1.0

100 110 120 140130

20.5

21.0

0.5

1.5

23

3


x

0

150 200 250 350300

21

22

1

2

(A) I only(B) II only(C) III only(D) I and III only(E) II and III only

Use the scatterplot below to answer questions 9 and 10.

6.0

8.5

x

7.0

4 5 7 98

6.5

7.5

8.0

6

y

A B

C

D

9. Which of the labeled points would have the largest re-sidual when a linear model is fit to the data?

(A) A(B) B(C) C(D) D(E) Both C and D

10. Which of the labeled points corresponds to a potentially influential observation if a linear model is to be fit to the data?

(A) A(B) B(C) C(D) D(E) Both C and D

11. If there is evidence of a linear relationship between x and y, what decision will be made in a test of H

0: b 5 0

versus H0: b Þ 0?

(A) Reject H0 and conclude that there is no evidence

that the linear model is useful(B) Reject H

0 and conclude that there is evidence that

the linear model is useful(C) Fail to reject H

0 and conclude that there is no evi-

dence that the linear model is useful(D) Fail to reject H

0 and conclude that there is evidence

that the linear model is useful(E) Not enough information to say.

778 chAPter 16 Understanding Relationships—Numerical Data Part 2


85241_ch16_ptg01.indd 778 20/12/12 6:39 PM

779

Use the following information to answer questions 12 and 13.

As part of a study of the swimming speed of sharks, a ran-dom sample of 18 lemon sharks (Triakis semifasciata) were observed in a laboratory sea tunnel. Body lengths and maximum sustainable swimming speeds (“MSSS,” reported in body lengths per second) were measured for each shark. The computer output from a regression with y = MSSS and x = body length is given below.

linear FitMSSS = 1.8928955 - 0.0104278*Lengthsummary of FitRSquare 0.526395RSquare Adj 0.496794S 0.272031Mean of Response 1.24N Observations 18


source DFsum of squares

Mean square F Ratio

Model 1 1.3159870 1.31599 17.7834Error 16 1.1840130 0.07400 Prob > FTotal 17 2.5000000 0.0007*

Parameter estimatesterm estimate std error t Ratio Prob>|t|Intercept 1.8928955 0.167575 11.30 ,.0001*Length(cm) 20.010428 0.002473 24.22 0.0007*

12. For this data set, the model utility test is based on how many degrees of freedom?

(A) 15(B) 16(C) 17(D) 18(E) 19

13. What is the P-value associated with the model utility test?

(A) 0.0001(B) 0.0007(C) 0.07400(D) 0.167575(E) 0.526395

14. Which of the following is not an assumption that is made about the random deviation e in a simple linear regression model?

(A) The distribution of e is normal.(B) The standard deviation of e, s

e , depends upon the

particular value of x.(C) The mean value of e is 0.(D) The random deviations, e

1, e

2 …, e

n , associated

with different observations are independent of one another.

(E) The standard deviation of e, se , is the same for

each x value.

15. The residual plot below indicates that the one or more of assumptions of the linear regression model may not be met. Which of the following is a reasonable conclu-sion based on this residual plot?

23

3


x

0

1

150 200 250 350300

21

2

22

(A) The residual plot clearly indicates a non-linear model would be more appropriate.

(B) There is evidence that the residuals are not nor-mally distributed.

(C) The slope of the regression line is non-zero.(D) The correlation between x and y is non-zero.(E) There is evidence the residuals do not have the

same variance for all x values.

779


Review Questions

85241_ch16_ptg01.indd 779 20/12/12 6:39 PM

Documents

Simple Linear Reg