Linear Regressionbkrein/Introductory Statistics... · Example 13.1.2. Find the equation of the regression line for the data in the previous example. Use the equation to predict the

Chapter 13

Linear Regression

289

290 CHAPTER 13. LINEAR REGRESSION

13.1 Descriptive Statistics using Linear Regression

In this chapter we will look at the relationship between two different variables. They are typicallygoing to measure different things. For example, heights versus weights. They have different units aswill be the case for a lot of regression problems. Since we have two variables, we need to distinguishthe two variables. Just as when you looked at graphs in a previous course, you had x and y.The variable x is called the independent variable or explainatory variable. The variable y is thedependent variable. It ‘depends’ on x. And x ‘explains’ the changes in y.

As we continue to look at regression we need to distinguish an observational study vs. anexperiment. In order for us to do an experiment we need to be able to change one variable andthen see what happens to the other variable. For many situations, this is something we can notdo. For example if you are looking at height vs weight, we can’t change someone’s height and seewhat then happens to their weight. Instead we observe several people’s heights and weights. Weare looking at ‘correlation not causation’. Even if you think that one of the variables causes theother variable to change, the only way to really determine causation is with an experiment.

We will limit our attention to linear regression. Why linear? There are two good answers: first,it is a realtively easy model; second, a lot of relationships are linear on a small interval even if theyare non linear globally.

We are going to make some assumptions up front and we will proceed from there. First of all,we will assume that the data has a general linear relationship given by y = A+Bx1. Most student’sexperience with models of this form was assuming the model was exact. That is, for a given valueof x you would plug in the value into the equation and get y and this was the ‘answer’. We are nowintroducing variablility into the situation. Consider the relationship between heights and weightsof adult Americans. If you were given an equation of the form y = A + Bx we certainly wouldn’texpect that if we plug into the equation a person’s height we would expect to get exacty the person’sweight. We would expect to be close but not the exact weight. If we look at the weights of allAmericans that were, say, 5′7′′, their weights would follow a distribution.

y = A+Bx

x

y

In the diagram, the line y = A + Bx is indicated. A few distributions are shown for different

1some texts use y = α+ βx

13.1. DESCRIPTIVE STATISTICS USING LINEAR REGRESSION 291

values of x. Notice that the distributions are: normal, the mean is on the line, and the standarddeviations are all the same. We can put this in the following form:

y = A+Bx+ ε

where

ε ∼ N(0, σε)

13.1.1 Scatter Plots

Just as we looked at graphical displays when we first stared looking at data, we will do the samehere. Consider the picture in the next example. If we randomly select a person that is 5 feet tallthere weight would fall somewhere on the vertical line above 5 feet on the x-axis. Likewise for anyother heights we happen to get. If we look at all of these points together we get a scatter plot. Wewlll examine this in the next example.

Example 13.1.1.

A random sample of 8 recently recruted soldiers is selected after finishing boot camp. Theirheights and weights are in the table below. Use the heights as the independent variable and drawa scatterplot. Discuss if the assumptions required are reasonable in this case.

Height, inches 60 68 73 66 64 69 71 63Weight, pounds 105 137 195 159 134 184 201 134

Solution.

To graph the scatter plot we need to plot the ordered pairs on a pair of axes. Since this is avisual display we need to include information to the reader: scales, titles, etc.

When we examine the graph, we notice an overall linear trend. That is not to say the pointslie on a line, clearly they do not. They do however follow a line, more or less. Also note that if weexamine the scatter plot and imagine a line of ‘best fit’ passing through the data values, the pointswill be evenly distributed on either side of the line.


58 60 62 64 66 68 70 72 74100

120

140

160

180

200

220

Height, inches

Weigh

t,pou

nds

Height and Weight of Boot Camp Graduates

13.1.2 Finding the Line of Best Fit

We would like to estimate the equation of the line that data follows. Like estimates before, we willhave point estimates.

The best fitting line is graphed along with the scatter plot below.

58 60 62 64 66 68 70 72 74100

120

140

160

180

200

220

Height, inches

Weigh

t,pou

nds



The term ‘best fit’ needs to be defined. Afterall, what does ‘best’ mean? Let us consider ourexample above. Clearly, there should be a relationship between height and weight. The taller aperson is the more they weigh. (Remember, we are talking trends here). We would ultimately liketo estimate someone’s weight simply by knowing their height.

In the next graph let us look at the distance from each point to the line.

58 60 62 64 66 68 70 72 74100

120

140

160

180

200

220

Height, inches

Weigh

t,pou

nds


The lengths of the line segments represent what are called the residuals. If you look at thelargest of the residuals you would describe this person as underweight. For this person’s height, 68inches, the line predicts about 165 pounds but their actual weight is 137 pounds.

If we call the residuals e (our sample version of ε), then we want to find a and b so that Σe2 isminimized.2 This is why we often refer to the line as the ‘least squares’ regression line.

Example 13.1.2.

Find the equation of the regression line for the data in the previous example. Use the equationto predict the weight of a 65 inch tall recent boot camp graduate.

Solution.

As mentioned before, we will rely on our calculator to find this. On a TI-83/84, input the valuesinto L1 and L2.

STAT>CALC>LinReg(a+bx) (option 8)Note that we want option 8, not option 4, which is ax+b

2This is a rather straightforward problem for someone who has taken calculus. We will rely on our calculator forthe values.


Once we have selected the correct option specify where your lists are: the x-list is in L1 and they-list is in L2.

You should get the following output

LinRegy=a+bxa=-330.7946768b=7.294676806

You may also have r2 and r. More on that later.Our equation is y = −330.79 + 7.295xNote that this equation is only applicable to the population in question: all recent boot camp

graduates.Let us evaluate the equation at x = 65. y = −330.79 + 7.295× 65 = 143.385. A reasonable way

to round would be to the nearest pound so our estimate of a recent graduate that is 65 inches tallis 143 pounds. As mentioned before, this is a point estimate. Interval estimates are on their way.

13.1.3 Interpreting a and b

If we recall from algebra, -330.79 represents the y-intercept and 7.295 represents the slope. Wewant to interpret these in terms of the problem. If we recall how we get the y-intercept is to setx equal to 0. In this particular case, this means that for a person that is 0 inches tall, they areexpected to weigh -330 pounds. This makes no sense. What this is telling us that we are pushingour model way beyond reasonable values of x. In several situations, the y-intercept will have nomeaning. Often, 0 is so far from the values of x that its interpretation is not reliable as an estimate.

Now let’s look at 7.295. This is the slope. Remember that

slope =rise

run

If we set the run equal to one we get

slope =rise

run= 7.295 =

rise

1

or7.295 = rise

What we get out of this is if the height increases by one inch (run = 1) then we expect theweight to increase by 7.295 pounds. We are not suggesting that the boot camp graduates will grow.What is happening is if we pick a graduate that is one inch taller that another graduate, we expectthem to weigh 7.295 pounds more than the other graduate.

Example 13.1.3.

A backyard farmer loves to grow tomatoes. The farmer is trying to determine the relationshipbetween yield and amount of water the plants get. The farmer turns on the drip water system everyday and times how long the water is on each plant. The data are in the folloiwng table.


Water times, minutes 4.8 5.2 6.1 6.8 7.5 7.3 7.1Yield, pounds 50.2 51.8 56.4 66.5 60.1 73.2 66.1

Notice the water units are a bit odd. The farmer doesn’t have an easy way to determine thevolume of water, which is what you really want, but the farmer can easily determine how long thewater runs.

1. Is this an experiment or observational study?

2. Do you expect a linear relationship between the two variables?

3. Construct a scatter plot for the data.

4. Do the data follow a linear relationship?

5. Find the equation of the regression equation

6. Interpret a and b

7. Predict the yield if the water is run for 6.5 minutes each day.

Solution.

1. Since our farmer is changing the amount of water and observing the yield it would be consid-ered an experiment. It could be better. To be better, the farmer should have a better way tomeasure the water.

2. A linear relationship would be unrealistic for all values of x. If you overwater, or underwater,the plants, they will die and produce no tomatoes. But if we have a fairly small range ofvalues, we expect the relationship to be close enough to linear.

4.5 5 5.5 6 6.5 7 7.5

50

60

70

Water, minutes

Yield,pou

nds

Tomato Yield and Amount of Water


3. Yes, although the data don’t fall on a line, they follow a line fairly closely.

4. By running the regression analysis on our calculator like the last example we get

y = 17.58 + 6.725x

5. The value of a, which is 17.58 represents the yield of tomatoes, in pounds, if no water isapplied. (If you live in a hot region you will say this doesn’t make sense, the plant will die.In cooler climates or where there is enough rainfall there is ample moisture in the soil theywill grow without any application of water by the farmer)

The value of b is the slope which tells us that for each addional minute the water is run, wecan expect to get an additional 6.725 pounds of tomatoes.

6. We expect

y = 17.58 + 6.725× 6.5 = 61.3

So if a plant is watered for 6.5 minutes we expect a yield of 61.3 pounds of tomatoes.

Finally, for our descriptive statistics portion of regression we want a measure of how well thedata fits the line. In the last example we noticed that more water meant more tomatoes in the rangeof values. If there was no relationship between the amount of water and the yield, our estimatefor the yield would be the same regardless of how much water was applied. A big if, granted. Ifwe look at the yields, it is easy to calculate the average yield for the plants. It is 60.6 pounds. Soif you asked the yield if water is applied for 5.4 minutes the expected yield would be 60.6 pounds.7.4 minutes? 60.6 pounds. 4.9? 60.6 pounds!

Consider the following table.(Values rounded)

x y y e = y − y e2 = (y − y)2 e = y − y e2 = (y − y)2

4.8 50.2 49.9 0.35 0.12 -10.41 108.455.2 51.8 52.5 -0.74 0.55 -8.81 77.696.1 56.4 58.6 -2.20 4.83 -4.21 17.766.8 66.5 63.3 3.20 10.21 5.89 34.647.5 60.1 68.0 -7.91 62.59 -0.51 0.267.3 73.2 66.7 6.53 42.66 12.59 158.407.1 66.1 65.3 0.78 0.61 5.49 30.09

121.60 427.31

Consider the first row. We had a plant receive 4.8 minutes of watering. The yield was 50.2pounds. The model y = 17.58 + 6.725x predicted 49.9 pounds of tomatoes. We had .35 poundsmore than the model predicted (We have rounded the values in the table). This is what we get inthe fourth column. If we don’t consider the amount of water applied, then our prediction is 60.6pounds. We had 10.41 pounds less than this prediction. This is in the column labeled e.

If we add up the e2 using the model and not using the model we obtain

Σe2 = 121.60 Using the regression model


Σe2 = 427.31 Not using the regression model

Think of the 427.31 as the total error. After we use the model, the error remaining is 121.60.If we subtract them, we see that we removed 305.71 of the error. So what percent of the error didwe remove?

305.71

427.31= 0.7154

So we removed 71.54% of the error by applying the model. We would like to remove 100% butthat would imply an exact relationship between the two variables. If we wanted to remove more ofthe error and get a better prediction, we could look at additional independent variables: amount ofsunlight the plant gets, position in the garden, etc. This would lead us to multiple linear regressionor even multiple non-linear regression. Subjects for a different textbook.

The Coefficient of Determination of a bivariate sample data set, denoted r2, is the proportionof the error removed from between the independent and dependent variables. The populationCoefficient of Determination is denoted ρ2.

The Correlation Coefficient of a bivariate sample data set is given by r and has the same signas b. The population correlation coefficient is denoted ρ and has the same sign as B.

One thing we need to be concerned about is causation versus correlation. We can have, andoften times do have, variables that are correlated but one does not cause the other. The followingexample illustrates this.

Example 13.1.4.

The divorce rate in Maine, per 1000 marriages, and the per capita consumption of margarine,in pounds, for several years are given in the table below.

Margarine 8.2 7.0 6.5 5.3 5.2 4.0 4.6 4.5 4.2 3.7Divorce rate 5.0 4.7 4.6 4.4 4.3 4.1 4.2 4.2 4.2 4.1

1. Do you expect a linear relationship between the two variables?

2. Construct a scatter plot for the data with per capita margarine consumption as the indepen-dent variable, x.

3. Do the data follow a linear relationship?

4. Find the equation of the regression equation.

5. Find the correlation coefficient.

6. Interpret a and b.

7. If the per capita margarine consumption is 5.5 pounds per person estimate the divorce ratein Maine.


8. Comments

Solution.

1. We see no reason as to why there should be a reason for a relationship between the variables,so no.

2.

3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.54

4.2

4.4

4.6

4.8

5

Margarine Consumption per capita, pounds

Divorce

rate,per

1000

Divorce Rate in Maine and Margarine Consumption

3. Looking at the data set we see a very strong positive linear relation.

4. From our calculator we get y = 3.309 + .201x

5. From our calculator we get r = .993

6. a = 3.309 which means if there were no margarine consumed, then the divorce rate wouldbe expected to be 3.3 per 1000. b = .201 which tells us that for each additional pound ofmargarine consumed per capita, the divorce rate is expected to increase by .2 per 1000.

7. y = 3.309 + .201× 5.5 = 4.4145 or we expect the divorce rate to be 4.4 per 1000 marriages.

8. Comments: our interpretation of a has little or no reliability. The value x = 0 is very farfrom the rest of the data set. From the problem, we see we have a very strong relationshipbetween the two variables. If we feel the divorce rate in Maine is too high should we outlawmargarine? Of course not. This data came from tylervigen.com/spurious-correlations. Thewebsite is plugging a book about spurious correlations. If you look at a lot of different datasets, you will eventually find a pair that has a very high correlation coefficient, just by chance.

13.1.4 Exercises

1. In Greenville, the city is investigating trash bin and recycle bin use for the residents. A sampleof several homes are taken and the amounts of trash and recycling are noted each week. Theamount of trash and recycling are given.

Recycling, gallons 43 38 53 61 44 50 49 51Trash, gallons 32 42 18 13 45 32 40 19


(a) Construct a scatter plot with amount of recycling as the independent variable. Do thedata exhibit a linear relationship?

(b) Find the equation of the regression line.

(c) Interpret the values of a and b. If no reasonable interpretation, give a reason why.

(d) A resident is selected and it is determined that they recycle 45 gallons per week. Whatis the expected amount of trash they generate?

(e) Find and interpret the correlation coefficient and the coefficient of determination.

2. At a party where alcohol is consumed the guests have a breathalyzer and are having fun withit. One partygoer arrives at the party and drinks several shots in succession and immediatelystarts a stopwatch and has no more alcohol. At several times the partygoer blows into thebreathalyzer and records the BAC(blood alcohol content) along with how long since they tookthe drinks.

Time, minutes 24 35 61 77 95 123 152BAC 0.113 0.105 0.096 0.093 0.086 0.078 0.071

(a) Construct a scatter plot with time as the independent variable. Do the data exhibit alinear relationship?



(d) What is the expected BAC after 100 minutes?

(e) How long after the drinks were taken will the BAC be 0.04? Comment on this result.

(f) Find and interpret the correlation coefficient and the coefficient of determination.

3. A reservoir is fed by several rivers and streams. A hydrologist is measuring the total annualrainfall at a location and the water in the reservoir on May 1st for several years.

Rainfall, inches 26.5 28.9 35.7 44.6 29.7Water Storage acre-feet 7700 7890 8070 8240 7630


(a) Construct a scatter plot with rainfall as the independent variable. Do the data exhibita linear relationship?



(d) What is the expected water in the reservoir for an annual rainfall of 40.0 inches?

(e) If the reservoir has 8000 acre-feet, what is your estimate of the rainfall that year?


4. In Bakersfield, a city in the southern California Central Valley, it gets hot in the summer.Very hot. An energy consumer is comparing their energy use with the high temperature forthe day. The high temperature for several days in the summer are given.


Temperature, F 95 105 106 98 98Energy Usage, kWh 19.3 22.5 32 19.9 22.1

Temperature, F 99 101 94 105 100Energy Usage, kWh 21.3 16.0 28.5 29.5 23.5

(a) Construct a scatter plot for the data given, with Temperature as the independent vari-able. Do the data exhibit a linear relationship?



(d) Predict the energy usage for the household if the high temperature outside is 99 degrees.

(e) What value does the model predict for a high temperature of 78. Comment on the result.

(f) Find the correlation coefficient and the coefficient of determination. What do these tellus?

5. The largest part of the cost of a gallon of gasoline is the cost of crude oil.

Cost of a crude, $/barrel 29 41 65 101 95 42Cost of gasoline, $/gallon 1.73 1.95 3.21 3.28 3.25 2.24

(a) Use the price of crude as the independent variable and construct the scatter plot. Dothe data exhibit a linear relationship?



(d) An investor expects the price of crude to be $98 per barrel. What is the expected priceof gasoline.

(e) What if the price of crude jumps to $200 a barrel, what about the price of gasoline then?Comment on your answer.


6. The lengths and weights of several newborn babies born at Memorial Hospital is observed.The results are given

Length, cm 56 55 55 56 54 57Weight, ounces 123 110 96 132 105 108

Length, cm 54 59 59 54 56Weight, ounces 103 140 137 121 119

(a) Construct a scatter plot with length as the independent variable. Do the data exhibit alinear relationship?



(d) The records for a newborn states the length as 58 cm but the weight is missing. Whatis your best guess for the weight?


(e) A newborn is finally asleep. You weigh the baby and find the weight is 120 ounces. Youknow that to measure the length would surely wake up the baby. What is your bestguess for the length?


7. The voulnteers of the Pelagic Shark Research Poundatiaon colledted data on bat rays in theElkhorn Slough. The total length and disk width of several specimens were measured. Thedata follow.

Total Length, cm 28 38 34.5 25.0 33.0 29.5 30.0 34.0Disk Width, cm 40.0 50.5 47.0 29.5 44.0 41.0 42.0 47.0

(a) Explain why you expect b to be positve.

(b) Construct a scatter plot with total length as the independent variable. Do the dataexhibit a linear relationship?

(c) Find the equation of the regression line.

(d) Interpret the values of a and b. If no reasonable interpretation, give a reason why.

(e) You have agreed to assist in the measurement of the bat rays. You capture a bat rayand find the total length to be 31.0 cm. What do you expect the disk width to be?

(f) You have captured a bat ray. It is flopping around on the beach and leaves an imprint ofthe disk on the beach before escaping back into the water. You measure the disk widthfrom the imprint in the sand and find it to be 43.5 cm. Estimate the total width.

(g) Find and interpret the correlation coefficient and the coefficient of determination.

8. The Sacramento and San Joaquin drainage areas are part of the water storage areas of theCalifornia Department of Water Resources. The storage of the areas are recorded on June30th for the years 2013-2109. The water storage, in 1000’s of acre feet are given below.

Year 2013 2014 2015 2016 2017 2018 2019Sacramento 11348.1 8273.9 8268.3 13026.6 13930 12596.7 15204.9

San Joaquin 6524.4 4948.9 4084.4 6330.8 10570.2 9279.9 10302.2

(a) Explain why you expect b to be positve.

(b) Construct a scatter plot with Sacramento as the independent variable. Do the dataexhibit a linear relationship?

(c) Find the equation of the regression line.

(d) Interpret the values of a and b. If no reasonable interpretation, give a reason why.

(e) You have found a value in the past that indicates the storage in the Sacramento area was11,000,000 acre-feet. Use the equaiton to predict the water storage in the San Joaquinarea.


9. Below are four different data sets. For each data set, find the equation of the regression line,the correlation coefficient, and the scatterplot. Comment on your findings.

Data Set I


x 10 8 13 9 11 14 6 4 12 7 5y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68

Data Set II

x 10 8 13 9 11 14 6 4 12 7 5y 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 4.74

Data Set III

x 10 8 13 9 11 14 6 4 12 7 5y 7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73

Data Set IV

x 8 8 8 8 8 8 8 19 8 8 8y 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 6.89

13.2. HYPTOTHESIS TESTS AND CONFIDENCE INTERVALS FOR B 303

13.2 Hyptothesis Tests and Confidence Intervals for B

In the first part of our regression discussion, we focused solely on the descriptive portion of regres-sion. We now turn our attention to the inferential statistics portion of regression. Specifically, wewill discuss hypothesis tests and intervals.

When we began our journey through descriptive statistics we started with point estimates. Wedidn’t refer to them as such until much later, but that is what we did. We then constructedconfidence intervals and performed hypothesis tests on the associated parameters. We do the samein this chapter. We have calculated r, y, and b. We will now perform inferences on their populationcounterparts: ρ, y, and B, respectively.

As with the previous part of the chapter, we will limit the use of formulas and allow ourcalculators to do the work.

13.2.1 Hypothesis Tests of H0 : B = 0 and H0 : ρ = 0

Recall that b and r (and also B and ρ) have the same sign. For hypothesis tests where thehypothesiized value is 0 we can do the tests together.

We now need to bring back our initial assumtions about the population:

For Linear Regression Inferences we need

y = A+Bx+ ε

whereε ∼ N(0, σε)

We will use the t-distribution with

df = n− 2

Notice the degrees of freedom are different for these tests than the tests of µ. In tests of µ wehad a one dimensional problem so df = n − 1. In the linear regression model, we are looking at atwo dimensional problem. This is why df = n− 2.

Let’s jump right in with an example:

Example 13.2.1.

A random sample of 8 recently recruted soldiers is selected after finishing boot camp. Theirheights and weights are in the table below. Test the null hypothesis that the two variables have apostive correlation. Use a 5% level of significance.


Solution.


We have already observed that the assumptions are reasonable. Note that the problem is askingif the variables have a ‘positive correlation’. This is the same as saying ρ > 0. Note that we expectthis to be true. As height increases the weight increases as well.

We have the followingH1 : ρ > 0α = .05

1.

{H0 : ρ = 0H1 : ρ > 0

or

({H0 : ρ = 0H1 : ρ > 0

)

2. Use t because y = A+Bx+ ε where ε ∼ N(0, σε), σε is unknown

3. t

.05

1.943

4. We will use our calculator here (see below). We get t = 5.746 with a p-value of 0.0006

5. Since the test statistic is in the rejection region (or equivalently the p-value is less than α),we conclude that a person’s height and weight are positively correlated.

Hypothesis test of B and ρ on the Calculator

� Enter the data in a list

� Select STAT>TESTS >LinRegTTest

� Specify the lists for x and y.

� Leave Freq at 1

� Select the alternative hypothesis (only necessary if using p-value approach)

� Highlight Calculate and hit ENTER.

Example 13.2.2.

Recall our backyard farmer who loves to grow tomatoes. Determine, using a 1% significancelevel, if increasing watering time increases tomato yield, on average.



Solution.

If we examine the scatterplot, the assumptions aren’t as obvious as the scatterplot for the heightand weight. We will, however proceed.

We have the followingH1 : B > 0α = .01

1.

{H0 : B = 0H1 : B > 0

2. Use t because y = A+Bx+ ε where ε ∼ N(0, σε) and σε is unknown

3. t

.01

3.365

4. On our caculator we get t = 3.546 with a p-value of 0.0082

5. We conclude that as the time the plants are watered increases, the yield of tomatoes increases.

A comment: since the assumptions required to do the inference here were not clearly met,the results here are not so black and white. More data is required from the farmer. Also, if theconclusion is correct, then it is only valid in the range of values for watering times that are given.

13.2.2 Hypothesis Tests of B when B0 6= 0

In the hypothesis test of the heights and weights, it was obvious going into the problem that theyshould be positively correlated. (Of course, as good statistics students, we need proof.) What ifwe have an idea about what B might be? When you did your LinRegTTest, there was no place toenter a value of B. We will need to tweek the results of the Test to get an appropriate test statistic.We now need to look at the test statistic we used before.

t =b−B0

sb

We didn’t give the formula before now. This is what the calclulator uses to do the test. Weneed to change the value of B0. The value the caluclator uses is 0. What the calculator calculates


is t = bsb

. We can solve for sb and we get sb = bt . We can plug this into the formula above to get

our test statistic. To avoid confusion, let tc be the value the calculator gives for the test statistic.So we get

t =b−B0

b/tcor t = tc(1−B0/b)

Where t is the test statistic with null hypothesis H0 : B = B0 and tc is the test statistic fromthe LinRegTTest.

Finding the Test Statistic when B0 6= 0Use the formula

t = tc(1−B0/b)

Where t is the test statistic with null hypothesis H0 : B = B0 and tc is the test statisticfrom the LinRegTTest.

Example 13.2.3.

We have heard that for each additional inch, the weight should increase by 5 pounds. We feelthat this is not correct for all of our recent recruits. Use a 5% level of significance to test if theexpected increase in weight by an increase of 1 inch in height is different from 5 pounds.


Solution.

The description is exaclty how we would describe b (or B). Since we want to make a statementabout all recuits, we are actually describing B. Specifically, H1 : B 6= 5 We have:

H1 : B 6= 5α = .05

1.H0 : B = 5H1 : B 6= 5

2. Use t because y = A+Bx+ ε where ε ∼ N(0, σε) and σε is unknown

3. t

.005 .005

−4.032 4.032


4. From the calculator we get

tc = 5.746 . . .b = 7.294 . . .

So we get t = tc(1−B0/b) = 5.746(1− 5/7.294) = 1.808

5. There is not sufficient evidence that for each additional inch of height, the weight increases,on average, by something other than 5 pounds.

13.2.3 Confidence Intervals of B

Some calculators will calcuate these directly, others will not. We will proceed in this section as ifyou have the later. The confidence interval formula for B should look reasonable. It is similar instructure to most of our confidence intervals.

b± tα/2sbUsing sb = b/tc from before we get

b± tα/2b

tc

A (1− α/2)× 100% Confidence Interval for B is given by

b

(1± tα/2

tc

)

Where tc is the test statistic from the LinRegTTest.

Finding a Confidence Interval for BUse the formula

b

(1± tα/2

tc

)

Where tα/2 is from the t-table and tc is the test statistic from the LinRegTTest.

Example 13.2.4.

Construct a 95% confidence interval for the increase in weight for each additional inch of height.


Solution.


With 6 degrees of freedom we get t.025 = 2.447 After running the LinRegTTest we gettc = 5.746 . . .b = 7.294 . . .

Our confidence interval of B is given by

7.294(1± 2.447/5.746)

or

4.188 to 10.401

Finally, we are 95% confident that for each additional height in the recruits, weight goes up by4.2 to 10.4 pounds, on average.

13.2.4 Exercises



(a) Using a 5% level of significance, test if the amount recycled and amount of trash arenegatively correlated.

(b) Construct a 95% confidence for the rate of change of the amount of trash with respectto amount recycled.


Time, minutes 24 35 61 77 95 123 152BAC 0.113 0.105 0.096 0.093 0.086 0.078 0.071

(a) A report indicates that the BAC reduces by .0002 per hour. Test using a 5% level ofsignificance if the report is incorrect.

(b) Construct a 95% confidence interval for the slope of the regression line.





(a) A neighbor of the hydrologist claims that ‘For each additional inch of rain, the reservoirincreases water storage by 25 acre-feet.’ Test if the neighbor’s statement is false.





(a) The consumer has read that for each additional degree the high is, the energy usage goesup by 1 kWh. Test at the 5% level of significance if the statement is false.

(b) Construct a confidence interval for the true slope of the regression line.



(a) Test if the price of crude and the price of gasoline are positively correlated. Use a 5%level of significance





(a) Test at the 5% level of significance if the length and weight of babies are positivelycorrelated.




(a) Test, at the 1% level of significance if the slope of the regression line is different from 1.




Year 2013 2014 2015 2016 2017 2018 2019Sacramento 11348.1 8273.9 8268.3 13026.6 13930 12596.7 15204.9

San Joaquin 6524.4 4948.9 4084.4 6330.8 10570.2 9279.9 10302.2

(a) Test, at the 1% level of significance if the slope of the regression line is less than 1.


9. Below are four different data sets from the last section. For which data set (s) would it beappropriate to use the methods of this section. Explain.

Data Set I

x 10 8 13 9 11 14 6 4 12 7 5y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68

Data Set II

x 10 8 13 9 11 14 6 4 12 7 5y 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 4.74

Data Set III

x 10 8 13 9 11 14 6 4 12 7 5y 7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73

Data Set IV

x 8 8 8 8 8 8 8 19 8 8 8y 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 6.89

13.3. PREDICTION INTERVALS AND CONFIDENCE INTERVALS FOR µY |X 311

13.3 Prediction Intervals and Confidence Intervals for µy|xWhen we started our regression discussion one of the things we did was predict values of y for agiven value of x. In the tomato example, we found that when we watered a plant for 6.5 minutes,we expected a yield of 61.3 pounds. So was that an estimate of the yield of a particular plant if wewatered it for 6.5 minutes or an estimate of the average yield if water is applied for 6.5 minutes?Answer: both. What we would like to do now is create confidence intervals for both of these. Theseintervals are called prediction intervals and confidence intervals of µy|x, respectively. We will lookat the formulas for both.

A (1− α/2) 100% Prediction Interval for a given x is given by

(a+ bx)± tα/2se

√1 +

1

n+

(x− x)2

(n− 1)s2x

A (1− α/2) 100% Confidence Interval for µy|x is given by

(a+ bx)± tα/2se

√1

n+

(x− x)2

(n− 1)s2x

Notice that the formulas are almost the same. The only difference is the 1 under the radicalof the prediction interval. The formula share the following: a, b, x, and n. They also share se,the sample standard deviations of the errors, X and sx, the mean and standard deviation for theindependent variable, x. Notice the numerator of the last term under the radical: (x − x)2. Forvalues of x near the center of the data set, this is small. As the value of x moves further out fromthe mean, the entire radicand gets bigger and hence we get a wider interval. This is what we expect:for values of x that are far away from the data set, we have little confidence in the estimate. (If wepoured a lot of water on the tomato plants, we wouldn’t expect the predicted value from the modelto be very reliable.)

Let’s let n, the number of pairs of data values get large. Very large. In fact, so large that thestatistics are essentially equal to the corresponding parameters. We then get

(A+Bx)± zα/2σe and (A+Bx), respectively

For the first prediction interval, it has been reduce to a simple interval and the confidenceinterval has been reduced to a single number. The latter is what we expect to see. This was whatwe observed when we first started with confidence intervals.

As we can see, the formula is rather large for each case. As such we will enter a simple programon our calculator (TI-83 or 84) for each. We will call the prediction interval program PI and theconfidence interval for µy|x will be called CI.

� PRGM>NEW>Create New (hit the number 1 or [ENTER])

� Type in PI (You should be in ALPHA Lock mode.) [ENTER]

� PRGM>I/O>Disp (hit the number, 3, or highlight and hit [ENTER])

� 2nd>A-LOCK>“ENTER X” (The space is located with the 0 key, the quotes above the +key) [ENTER]


� PRGM>I/O>Input (hit the number, 1, or highlight and hit ENTER)

� X [ENTER]

� PRGM>I/O>Disp

� 2nd>A-LOCK>“ENTER T” [ENTER]

� PRGM>I/O>Input

� T [ENTER]

� a+b*X → P (See below) [ENTER]

� T*s*√

(1 + 1/n + (X− x)2/((n− 1)S2x))→ E (See below) [ENTER]

� PRGM>I/O>Disp P-E, P+E [ENTER] (P for point estimate, E for margin of error)

� 2nd QUIT

Your program is now ready to use.

For the above you need the following:

For X and T, use the ALPHA button and then find the letter.a is located at VARS>Statistics...>EQ>a (A is not the same as a here)b is located at VARS>Statistics...>EQ>bs is located at VARS>Statistics...>TEST>s (Scroll down to find)n is located at VARS>Statistics...>XY>nx is located at VARS>Statistics...>XY> xSx is located at VARS>Statistics...>XY> Sx

→ is STOB on keyboard (STO is just above the ON button.)

Now to write the program for our confidence interval of µy|x. This assumes you have the PIprogram in.

� PRGM>NEW>Create New

� CI [ENTER]

� 2nd>RCL>PRGM>EXEC>CI [ENTER]

This has copied the PI program into CI. Use your arrow keys to find the ‘1+’ . highlight the1 and hit your delete button twice.

� 2nd> QUIT

Your programs are now ready to use

IN ORDER TO USE THE PROGRAM YOU NEED TO RUN THE LinRegTTest FIRST!

You can ignore the output. Let us proceed with an example which utilizes our programs.


Using Programs PI and CI to find prediction intervals and confidence intervals of µy|x.

� Look up tα/2 in t-table

� Enter data into lists

� Run LinRegTTest (STAT > TESTS > LinRegTTest)

� CLEAR (Not required)

� PRGM>EXEC>CI (for confidence interval of µy|x, PI for prediction interval)

� Enter values of tα/2 and X as prompted

Example 13.3.1.

Construct a 95% prediction interval and a 95% confidence interval for the mean yield when a tomatoplant is watered for 6.5 minutes each day.

Solution.

To refresh our memories, here is the data


Notice in this case we are given the value of x as 6.5. We have 5 degrees of freedom. From ourtable we get t.025 = 2.571

Input the data in L1 (times) and L2 (yields) and run the LinRegTTest. (You don’t need to lookat the output, you can simply hit CLEAR.)

Now hit PRGM>EXEC>PI (Hit [ENTER] or simply hit the number.)[ENTER] (This starts the program)It should be asking you for X, enter 6.5 [ENTER]Now it should want T, enter the value we looked up, 2.571 [ENTER]

It should give you an interval of 47.723 . . . to 74.849. . .

So, we are 95% confident that the yield of a randomly selected tomato plant which has 6.5minutes of water applied each day will be between 47.7 and 74.8 pounds.

To get the confidence interval repeat the above proceedure but select CI instead of PI. Thecorresponding interval is 56.469 . . . to 66.103. . . .

So, we are 95% confident that the average yield for tomato plants that have 6.5 minutes of waterapplied each day is between 56.5 and 66.1 pounds.

Look at the two intervals. Notice that the prediction interval is wider. This is because theprediction interval includes the variablity of the individual, whereas the confidence interval does


not. Also note that the sentences for the second interval clearly addresses the mean where the firstinterval makes no mention of the mean.

Example 13.3.2.

Construct 90% prediction and confidence intervals for the mean height of all recruits that are65 inches tall.

Solution.

Here is the data from before.


We have 6 degrees of freedom and t.05 = 1.943 .

Enter the data into your calculator, run LinRegTTest,.

Next run the PI program. Enter 65 and 1.943 when prompted. We get 113.0 to 173.7 pounds.Do the same for the CI program and we get 132.5 to 154.3

We are 90% confident that a randomly selected recruit that is 65 inches tall will weigh between113.0 and 173.7 pounds.

We are 90% confident that the average weight of a recruit that is 65 inches tall is between 132.5and 154.3 pounds.

Below is a graph of the scatterplot of the heights and weights of our recruits along with theregression line (the darker one in the middle), the prediction interval curves (top and bottomcurves), and the confidence interval for µy|x, (the remaining two curves). Also, there is a verticalline at x = 65, the value we looked at in the last example. The prediction interval can be found bylooking at the y coordinates of the points where the line intersects the top and bottom curve. Youshould check that this is correct. Similarly we can get the confidence interval for µy|x, and even thepoint estimate by looking at the line in the middle.


58 60 62 64 66 68 70 72 74100

120

140

160

180

200

220

Height, inches

Weigh

t,pou

nds


13.3.1 Exercises



(a) Construct a 95% prediction interval for the amount of trash collected for a randomlyselected household that recycles 45 gallons.

(b) Construct a 95% confidence interval for the average amount of trash collected for house-holds that recycle 45 gallons.


Time, minutes 24 35 61 77 95 123 152BAC 0.113 0.105 0.096 0.093 0.086 0.078 0.071

(a) Construct a 95% prediction interval for the BAC after 1 hour.

(b) Construct a 95% confidence interval for the average BAC after 1 hour.





(a) Construct a 95% prediction interval for the water storage in a year that saw 30 inchesof rain.

(b) Construct a 95% confidence interval for the average water storage for years that see 30inches of rain.




(a) If a randomly selected day the high temperature is 101◦F, make a 99% prediction intervalenergy usage for the day.

(b) Find a 90% confidence interval for the mean energy usage for days when the temperatureis 101◦F.



(a) You have just read that the cost of crude is expected to be $80 per barrel when you leaveto go on vacation. Construct a 95% prediction interval cost of a gallon of gasoline.

(b) Construct the corresponding 95% confidence interval for the average price of gasolinewhen oil is $80 per barrel.




(a) If a baby that is 58 cm long is selected, what is the 90% prediction interval for the babiesweight?

(b) Of all babies that are 58 cm long, what is the 90% confidence interval for the meanweight?




(a) For a bat ray with a total length of 32.0 cm, find the 95% prediction interval for the diskwidth.

(b) Of all bat ray with a total length of 32.0 cm, find the 95% confidence interval for themean disk width.


Year 2013 2014 2015 2016 2017 2018 2019Sacramento 11348.1 8273.9 8268.3 13026.6 13930 12596.7 15204.9

San Joaquin 6524.4 4948.9 4084.4 6330.8 10570.2 9279.9 10302.2

(a) For a year in which the storage in the Sacramento drainage area is 10,000,000 acre-feetfind a 95% prediction interval for the storage in the San Joaquin drainage area.

(b) For all years in which the storage in the Sacramento drainage area is 10,000,000 acre-feetfind a 95% confidence interval for the mean storage in the San Joaquin drainage area.


Documents

Linear Regressionbkrein/Introductory Statistics... · Example 13.1.2. Find the equation of the regression line for the data in the previous example. Use the equation to predict the