Unit 3: Examining Relationships - MATH WITH MAYERmathwithmayer.weebly.com/uploads/3/7/2/7/37277397/handout_ap_… · Fat gain (kg) 4.2 3.0 3.7 2.7 3.2 3.6 2.4 1.3 3.8 1.7 1.6 2.2

Unit 3: Examining

RelationshipsLesson 3: Least-Squares Regression

2016-2017

Bivariate Relationships

So far, you’ve learned that when you explore/describe a bivariate (x,y) relationship, you must:

Determine the Explanatory and Response variables

Plot the data in a scatterplot

Note the Strength, Direction, and Form

Note the mean and standard deviation of x and the mean and standard deviation of y

Calculate and Interpret the Correlation, r

If it is determined that a

scatterplot shows a linear

relationship, then it only

makes sense to want to

model that linear relationship

with a line.

Least-squares regression is

a method for finding a line

that summarizes the

relationship between two

variables, but only in a

specific setting.

Finding the equation of that

line allows us to make

predictions.

Regression Line

• Unlike correlation, regression requires an explanatory

variable and response variable.

• FYI: least-squares regression line is often abbreviated

as LSRL.

Example

• Some people don’t gain weight even when they overeat.

One possible explanation is in “nonexercise activity” (NEA),

such as fidgeting: some people may increase NEA when fed

more.

• For 8 weeks, 16 healthy young adults were deliberately

overfed. Researchers measured fat gain (in kg) and the

change in energy use (in calories) from activity other than

deliberate exercise, such as in fidgeting, daily living, etc.

• Do people with larger increases in NEA tend to gain less fat?

NEA change (cal) -94 -57 -29 135 143 151 245 355 392 473 486 535 571 580 620 690

Fat gain (kg) 4.2 3.0 3.7 2.7 3.2 3.6 2.4 1.3 3.8 1.7 1.6 2.2 1.0 0.4 2.3 1.1

Let’s properly examine this

relationshipStep 1: Determine the Explanatory and Response variables

Explanatory: Nonexercise activity (in calories)

Response: Fat gain (kg)

Step 2: Plot the data in a scatterplot

Use your calculator

Step 3: Note the Direction, Form, and strength

Negative, linear, pretty strong


relationshipStep 4: Note the mean and standard deviation of x

and the mean and standard deviation of y

Use your 1-Var Stats

Explanatory variable: NEA Change (cal)

Mean: 324.8 calories

Standard Deviation: 257.66 calories

Response Variable: Fat gain (kg)

Mean: 2.388 kg

Standard Deviation: 1.1389 kg


relationshipStep 5: Calculate and Interpret the Correlation, r

Again, use your calculator:

r = -0.7786

Indicates a pretty nicely correlated set of data.

Step 6: So now what?• Between the scatterplot and the value of r, it can be

concluded that a fairly strong linear relationship exists

between nonexercise activity and fat gain.

• To define the linear relationship between an explanatory

and response variable, a line needs to be drawn and its

equation found.

• We all could look at the same data and draw very

different lines to model it. There are multiple ways to

draw a regression line. So which one is the best line?

• Step 6: Calculate and Interpret the Least Squares

Regression Line in context.

Equation of LSRL

In the equation…

• “y hat” is the predicted value of the response variable y for a given value of the explanatory variable x

• b is the slope

• a is the y-intercept

The Least-Squares Regression line creates the best fit for a scatterplot, and is

justified geometrically (see pgs145-149, my webpage for a video, or come ask

me if you’re curious).

• Understanding and using the line is more important than the

details of where the equation came from.

• What is important is to understand what the slope and y-

intercept mean in the correlation.

• Remember from Algebra 1, that slope is rate of change:

change in y (response variable) to change in x (explanatory

variable).

• Further, in statistics, along the regression line the slope

represents a change of one standard deviation in x that

corresponds to a change of r standard deviations in y.

• Also remember from Algebra 1 that the y-intercept is the value

for which the x is zero; in stats that means the y-intercept is

the predicted response variable value when the explanatory

variable is zero.

The statistical meaning of

slope and y-intercept

Example: Step 6: Calculate the

equation of the least-squares

regression line• Format: “Y-hat” = a +bx (according to book). Truthfully, either

a or b could stand for the y-intercept or the slope (so it could

be written as “y-hat” = ax + b), but no matter which, the slope

is ALWAYS the coefficient of x.

• Solution: You can find the values of a and b using your

calculator, under stat-calc.

• CAREFUL: 4:LinReg(ax+b) vs. 8:LinReg(a+bx). Again,

doesn’t matter which one you use, just make sure to

correctly identify the y-intercept and slope. I’ll be writing it

consistently with the book’s format.

“y-hat” = 3.515 – 0.0035x

What a and b mean in

context• In this example, the slope b = -0.00344 tells us

that according to the linear model, for each

calorie increase (increase in NEA), fat gained

goes down by 0.00344 kg. Our regression

equation is the predicted RATE OF CHANGE in

the response y as the explanatory variable x

changes.

• The Y intercept a = 3.505kg is the model’s

estimate of how much fat is gained when there

is no change in nonexercise activity (calories).

Graphing the line of

regression by hand

By hand: The line always goes through the point (x-

bar, y-bar), and the y-intercept. Since all you need to

graph a line is two points, these are an easy two

points to plot to make your line.

Plotting the line of regression on your TI after

your scatterplot is made and equation of

regression line is found

Using the model to

predict

• If a person’s NEA increases by 400 calories when

she overeats, how much fat gain is predicted?

Solution: use the model for predicting by substituting

the desired value into the equation. If x = 400, then

“y-hat” equals 2.13 kg.

Example 2: RatsSome data were collected on the weight of a male white laboratory rat for

the first 25 weeks after its birth. A scatterplot of the weight (in grams) and

time since birth (in weeks) shows a fairly strong, positive linear relationship.

The linear regression equation “weight hat” = 100 + 40(time) models the

data fairly well.

1. What is the slope of the regression line? Explain what it means in the

context.

2. What’s the y-intercept? Explain what it means in the context.

3. Predict the rat’s weight after 16 weeks. Show your work.

4. Should you use this line to predict the rat’s weight at age 2 years? Use

the equation to make the prediction and think about the reasonableness of

your work.

Example 2: RatsSome data were collected on the weight of a male white laboratory rat for

the first 25 weeks after its birth. A scatterplot of the weight (in grams) and

time since birth (in weeks) shows a fairly strong, positive linear relationship.

The linear regression equation “weight hat” = 100 + 40(time) models the

data fairly well.

1. What is the slope of the regression line? Explain what it means in the

context. The slope is 40. It is predicted that a rat will gain 40

grams/week.

2. What’s the y-intercept? Explain what it means in the context. The y-

intercept is 100. It is estimated that the birth weight of a rat is 100 grams.

3. Predict the rat’s weight after 16 weeks. Show your work. 740 grams.

4. Should you use this line to predict the rat’s weight at age 2 years? Use

the equation to make the prediction and think about the reasonableness of

your work. The rat’s weight is predicted to be 4260 grams, which is about

9.4 pounds. This is absurd. The line should be used to predict far before

2 years.

In most cases, there is no line that will hit all the data points in a scatterplot.

But the line that best fits the data will minimize the vertical distances of the actual

data points from the predicted line.There are many possible

reasons that all the data values

do not fall into a straight line on

a scatterplot; most of these

reasons are different from data

point to data point, and

canNOT be individually

identified or quantified.

However, we can quantify what

percent of the variation in the

response variable, or y, is

attributed to the regression line.

It’s called the coefficient of

determination, or r-squared.

Assessing how well the data is fit by the line

R-squared: a way to assess the linear

relationship• The coefficient of determination, or r-squared, is

the fraction of the variation in the values of y that is

explained by the least-squares regression of y on

x.

• R-squared measures how successful the

regression was in explaining the response by

assigning the relationship a percentage; about r-

squared% of the variation is accounted for by the

linear relationship.

• The derivation of the concept of r-squared is

somewhat time-consuming; just know what it’s

value tells us about a linear relationship.

• So in our example, the value of r was -0.7786.

• Our r-squared then would be…0.6062

• This means that 60.6% of the variation (how it

changes) in y values is accounted for by the least-

squares line. The other 39% is not explained.

• BE SPECIFIC. What context is being referred to

with “y values”?

• Restated: 60.6% of the variation in fat gain is

accounted for by the least squares line.

R-squared Example

Residuals: another way to assess the

relationshipWe can informally

assess how well a line

fits a scatterplot by

examining the deviations

from the plotted points to

the line. Specifically, the

vertical distances are

noticed; the smaller the

vertical distance the

points are from above or

below the line, the

stronger a linear

relationship.

Residuals defined• A residual is the difference between an ovserved value

of the response variable and the value predicted by the

regression line.

• Residual = observed y – predicted y

• Residual = y – y-hat

Finding Residuals Example

x: NEA change

(cal)

-94 -57 -29 135 143 151 245 355 392 473 486 535 571 580 620 690

y: Fat gain (kg) 4.2 3.0 3.7 2.7 3.2 3.6 2.4 1.3 3.8 1.7 1.6 2.2 1.0 0.4 2.3 1.1

Using this data, we found that the equation used to predict fat gain was

approximately y-hat = -0.0035x+3.512.

Use this defined relationship to find the prediction for each of these x-values.

Our gathered data was :

x: NEA change

(cal)

-94 -57 -29 135 143 151 245 355 392 473 486 535 571 580 620 690

y-hat: predicted

Fat gain (kg)

Obviously, the predicted fat gain was going to be different than the observed fat

gain. How different were the observed from the predicted?

y: Fat gain (kg) 4.2 3.0 3.7 2.7 3.2 3.6 2.4 1.3 3.8 1.7 1.6 2.2 1.0 0.4 2.3 1.1

y-hat: predicted Fat

gain (kg)

3.84 3.71 3.61 3.04 3.00 2.98 2.65 2.26 2.13 1.84 1.80 1.62 1.49 1.46 1.32 1.07

y - y-hat 0.36 -0.71 0.08 -0.33 0.19 0.62 -0.25 -0.96 1.67 -0.14 -0.20 0.58 -0.49 -1.42 0.98 0.03

More about residuals• The mean of these residuals is always zero, though the

roundoff error could make the sum not exactly zero.

• We can plot our residuals on a residual plot; the plot

makes patterns easier to see as the residuals values on

the vertical axis are plotted against the explanatory

variable on the horizontal axis.

Residual Plot

• These vertical differences between

points and line, these residuals, are

considered errors in our predictions.

• The sum of the least-squares residuals is

always zero.

• The mean of the residuals is always zero,

the horizontal line at zero in the figure

helps orient us. This “residual = 0” line

corresponds to the regression line

• The closer the residuals are to the

horizontal line, the better the fit of the line

to the data

Making a residual plot

on the calculators1. X should be in L1, Y in L2.

2. Put cursor over L3; press 2nd, Stat/Resid. Press Enter.

3. Press enter again. A list of residuals should be in L3.

4. Turn all plots off under 2nd/y= to see stat plot. Also, check

y =. If you had a regression line plotted, its equation is

there, and if it confuses you to be there, clear it from y1.

5. Create new scatterplot. Xlist: L1 (explanatory variable)

and Ylist:L3 (residuals)

6. Zoom/9:ZoomStat

Examining Residual Plot

• Residual plot should show no obvious pattern. A curved

pattern shows that the relationship is not linear and a straight

line may not be the best model.

• Residuals should be relatively small in size. A regression line

in a model that fits the data well should come close” to most

of the points.

• A commonly used measure of this is the standard deviation

of the residuals, given by:

s =residuals

2

ån - 2

For the NEA and fat gain data, S = 7.663

14= .740

Facts about Least-

Squares regression• The distinction between explanatory and response variables is essential

in regression. If we reverse the roles, we get a different least-squares

regression line.

• There is a close connection between corelation and the slope of the

LSRL. Slope is r times Sy/Sx. This says that a change of one standard

deviation in x corresponds to a change of 4 standard deviations in y.

When the variables are perfectly correlated (4 = +/- 1), the change in the

predicted response y hat is the same (in standard deviation units) as the

change in x.

• The LSRL will always pass through the point (X bar, Y Bar)

• r squared is the fraction of variation in values of y explained by the x

variable

Influential Observations

• Correlation r is not resistant. One unusual point in the

scatterplot greatly affects the value of r.

• Least-Squares Regression Line also not resistant; a point

extreme in the x direction with no other points near it pulls

the line toward itself. This point is influential.

Documents

Unit 3: Examining Relationships - MATH WITH MAYERmathwithmayer.weebly.com/uploads/3/7/2/7/37277397/handout_ap_… · Fat gain (kg) 4.2 3.0 3.7 2.7 3.2 3.6 2.4 1.3 3.8 1.7 1.6 2.2