Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Unit 3: Examining
RelationshipsLesson 3: Least-Squares Regression
2016-2017
Bivariate Relationships
So far, you’ve learned that when you explore/describe a bivariate (x,y) relationship, you must:
Determine the Explanatory and Response variables
Plot the data in a scatterplot
Note the Strength, Direction, and Form
Note the mean and standard deviation of x and the mean and standard deviation of y
Calculate and Interpret the Correlation, r
If it is determined that a
scatterplot shows a linear
relationship, then it only
makes sense to want to
model that linear relationship
with a line.
Least-squares regression is
a method for finding a line
that summarizes the
relationship between two
variables, but only in a
specific setting.
Finding the equation of that
line allows us to make
predictions.
Regression Line
• Unlike correlation, regression requires an explanatory
variable and response variable.
• FYI: least-squares regression line is often abbreviated
as LSRL.
Example
• Some people don’t gain weight even when they overeat.
One possible explanation is in “nonexercise activity” (NEA),
such as fidgeting: some people may increase NEA when fed
more.
• For 8 weeks, 16 healthy young adults were deliberately
overfed. Researchers measured fat gain (in kg) and the
change in energy use (in calories) from activity other than
deliberate exercise, such as in fidgeting, daily living, etc.
• Do people with larger increases in NEA tend to gain less fat?
NEA change (cal) -94 -57 -29 135 143 151 245 355 392 473 486 535 571 580 620 690
Fat gain (kg) 4.2 3.0 3.7 2.7 3.2 3.6 2.4 1.3 3.8 1.7 1.6 2.2 1.0 0.4 2.3 1.1
Let’s properly examine this
relationshipStep 1: Determine the Explanatory and Response variables
Explanatory: Nonexercise activity (in calories)
Response: Fat gain (kg)
Step 2: Plot the data in a scatterplot
Use your calculator
Step 3: Note the Direction, Form, and strength
Negative, linear, pretty strong
Let’s properly examine this
relationshipStep 4: Note the mean and standard deviation of x
and the mean and standard deviation of y
Use your 1-Var Stats
Explanatory variable: NEA Change (cal)
Mean: 324.8 calories
Standard Deviation: 257.66 calories
Response Variable: Fat gain (kg)
Mean: 2.388 kg
Standard Deviation: 1.1389 kg
Let’s properly examine this
relationshipStep 5: Calculate and Interpret the Correlation, r
Again, use your calculator:
r = -0.7786
Indicates a pretty nicely correlated set of data.
Step 6: So now what?• Between the scatterplot and the value of r, it can be
concluded that a fairly strong linear relationship exists
between nonexercise activity and fat gain.
• To define the linear relationship between an explanatory
and response variable, a line needs to be drawn and its
equation found.
• We all could look at the same data and draw very
different lines to model it. There are multiple ways to
draw a regression line. So which one is the best line?
• Step 6: Calculate and Interpret the Least Squares
Regression Line in context.
Equation of LSRL
In the equation…
• “y hat” is the predicted value of the response variable y for a given value of the explanatory variable x
• b is the slope
• a is the y-intercept
The Least-Squares Regression line creates the best fit for a scatterplot, and is
justified geometrically (see pgs145-149, my webpage for a video, or come ask
me if you’re curious).
• Understanding and using the line is more important than the
details of where the equation came from.
• What is important is to understand what the slope and y-
intercept mean in the correlation.
• Remember from Algebra 1, that slope is rate of change:
change in y (response variable) to change in x (explanatory
variable).
• Further, in statistics, along the regression line the slope
represents a change of one standard deviation in x that
corresponds to a change of r standard deviations in y.
• Also remember from Algebra 1 that the y-intercept is the value
for which the x is zero; in stats that means the y-intercept is
the predicted response variable value when the explanatory
variable is zero.
The statistical meaning of
slope and y-intercept
Example: Step 6: Calculate the
equation of the least-squares
regression line• Format: “Y-hat” = a +bx (according to book). Truthfully, either
a or b could stand for the y-intercept or the slope (so it could
be written as “y-hat” = ax + b), but no matter which, the slope
is ALWAYS the coefficient of x.
• Solution: You can find the values of a and b using your
calculator, under stat-calc.
• CAREFUL: 4:LinReg(ax+b) vs. 8:LinReg(a+bx). Again,
doesn’t matter which one you use, just make sure to
correctly identify the y-intercept and slope. I’ll be writing it
consistently with the book’s format.
“y-hat” = 3.515 – 0.0035x
What a and b mean in
context• In this example, the slope b = -0.00344 tells us
that according to the linear model, for each
calorie increase (increase in NEA), fat gained
goes down by 0.00344 kg. Our regression
equation is the predicted RATE OF CHANGE in
the response y as the explanatory variable x
changes.
• The Y intercept a = 3.505kg is the model’s
estimate of how much fat is gained when there
is no change in nonexercise activity (calories).
Graphing the line of
regression by hand
By hand: The line always goes through the point (x-
bar, y-bar), and the y-intercept. Since all you need to
graph a line is two points, these are an easy two
points to plot to make your line.
Plotting the line of regression on your TI after
your scatterplot is made and equation of
regression line is found
Using the model to
predict
• If a person’s NEA increases by 400 calories when
she overeats, how much fat gain is predicted?
Solution: use the model for predicting by substituting
the desired value into the equation. If x = 400, then
“y-hat” equals 2.13 kg.
Example 2: RatsSome data were collected on the weight of a male white laboratory rat for
the first 25 weeks after its birth. A scatterplot of the weight (in grams) and
time since birth (in weeks) shows a fairly strong, positive linear relationship.
The linear regression equation “weight hat” = 100 + 40(time) models the
data fairly well.
1. What is the slope of the regression line? Explain what it means in the
context.
2. What’s the y-intercept? Explain what it means in the context.
3. Predict the rat’s weight after 16 weeks. Show your work.
4. Should you use this line to predict the rat’s weight at age 2 years? Use
the equation to make the prediction and think about the reasonableness of
your work.
Example 2: RatsSome data were collected on the weight of a male white laboratory rat for
the first 25 weeks after its birth. A scatterplot of the weight (in grams) and
time since birth (in weeks) shows a fairly strong, positive linear relationship.
The linear regression equation “weight hat” = 100 + 40(time) models the
data fairly well.
1. What is the slope of the regression line? Explain what it means in the
context. The slope is 40. It is predicted that a rat will gain 40
grams/week.
2. What’s the y-intercept? Explain what it means in the context. The y-
intercept is 100. It is estimated that the birth weight of a rat is 100 grams.
3. Predict the rat’s weight after 16 weeks. Show your work. 740 grams.
4. Should you use this line to predict the rat’s weight at age 2 years? Use
the equation to make the prediction and think about the reasonableness of
your work. The rat’s weight is predicted to be 4260 grams, which is about
9.4 pounds. This is absurd. The line should be used to predict far before
2 years.
In most cases, there is no line that will hit all the data points in a scatterplot.
But the line that best fits the data will minimize the vertical distances of the actual
data points from the predicted line.There are many possible
reasons that all the data values
do not fall into a straight line on
a scatterplot; most of these
reasons are different from data
point to data point, and
canNOT be individually
identified or quantified.
However, we can quantify what
percent of the variation in the
response variable, or y, is
attributed to the regression line.
It’s called the coefficient of
determination, or r-squared.
Assessing how well the data is fit by the line
R-squared: a way to assess the linear
relationship• The coefficient of determination, or r-squared, is
the fraction of the variation in the values of y that is
explained by the least-squares regression of y on
x.
• R-squared measures how successful the
regression was in explaining the response by
assigning the relationship a percentage; about r-
squared% of the variation is accounted for by the
linear relationship.
• The derivation of the concept of r-squared is
somewhat time-consuming; just know what it’s
value tells us about a linear relationship.
• So in our example, the value of r was -0.7786.
• Our r-squared then would be…0.6062
• This means that 60.6% of the variation (how it
changes) in y values is accounted for by the least-
squares line. The other 39% is not explained.
• BE SPECIFIC. What context is being referred to
with “y values”?
• Restated: 60.6% of the variation in fat gain is
accounted for by the least squares line.
R-squared Example
Residuals: another way to assess the
relationshipWe can informally
assess how well a line
fits a scatterplot by
examining the deviations
from the plotted points to
the line. Specifically, the
vertical distances are
noticed; the smaller the
vertical distance the
points are from above or
below the line, the
stronger a linear
relationship.
Residuals defined• A residual is the difference between an ovserved value
of the response variable and the value predicted by the
regression line.
• Residual = observed y – predicted y
• Residual = y – y-hat
Finding Residuals Example
x: NEA change
(cal)
-94 -57 -29 135 143 151 245 355 392 473 486 535 571 580 620 690
y: Fat gain (kg) 4.2 3.0 3.7 2.7 3.2 3.6 2.4 1.3 3.8 1.7 1.6 2.2 1.0 0.4 2.3 1.1
Using this data, we found that the equation used to predict fat gain was
approximately y-hat = -0.0035x+3.512.
Use this defined relationship to find the prediction for each of these x-values.
Our gathered data was :
x: NEA change
(cal)
-94 -57 -29 135 143 151 245 355 392 473 486 535 571 580 620 690
y-hat: predicted
Fat gain (kg)
Obviously, the predicted fat gain was going to be different than the observed fat
gain. How different were the observed from the predicted?
y: Fat gain (kg) 4.2 3.0 3.7 2.7 3.2 3.6 2.4 1.3 3.8 1.7 1.6 2.2 1.0 0.4 2.3 1.1
y-hat: predicted Fat
gain (kg)
3.84 3.71 3.61 3.04 3.00 2.98 2.65 2.26 2.13 1.84 1.80 1.62 1.49 1.46 1.32 1.07
y - y-hat 0.36 -0.71 0.08 -0.33 0.19 0.62 -0.25 -0.96 1.67 -0.14 -0.20 0.58 -0.49 -1.42 0.98 0.03
More about residuals• The mean of these residuals is always zero, though the
roundoff error could make the sum not exactly zero.
• We can plot our residuals on a residual plot; the plot
makes patterns easier to see as the residuals values on
the vertical axis are plotted against the explanatory
variable on the horizontal axis.
Residual Plot
• These vertical differences between
points and line, these residuals, are
considered errors in our predictions.
• The sum of the least-squares residuals is
always zero.
• The mean of the residuals is always zero,
the horizontal line at zero in the figure
helps orient us. This “residual = 0” line
corresponds to the regression line
• The closer the residuals are to the
horizontal line, the better the fit of the line
to the data
Making a residual plot
on the calculators1. X should be in L1, Y in L2.
2. Put cursor over L3; press 2nd, Stat/Resid. Press Enter.
3. Press enter again. A list of residuals should be in L3.
4. Turn all plots off under 2nd/y= to see stat plot. Also, check
y =. If you had a regression line plotted, its equation is
there, and if it confuses you to be there, clear it from y1.
5. Create new scatterplot. Xlist: L1 (explanatory variable)
and Ylist:L3 (residuals)
6. Zoom/9:ZoomStat
Examining Residual Plot
• Residual plot should show no obvious pattern. A curved
pattern shows that the relationship is not linear and a straight
line may not be the best model.
• Residuals should be relatively small in size. A regression line
in a model that fits the data well should come close” to most
of the points.
• A commonly used measure of this is the standard deviation
of the residuals, given by:
s =residuals
2
ån - 2
For the NEA and fat gain data, S = 7.663
14= .740
Facts about Least-
Squares regression• The distinction between explanatory and response variables is essential
in regression. If we reverse the roles, we get a different least-squares
regression line.
• There is a close connection between corelation and the slope of the
LSRL. Slope is r times Sy/Sx. This says that a change of one standard
deviation in x corresponds to a change of 4 standard deviations in y.
When the variables are perfectly correlated (4 = +/- 1), the change in the
predicted response y hat is the same (in standard deviation units) as the
change in x.
• The LSRL will always pass through the point (X bar, Y Bar)
• r squared is the fraction of variation in values of y explained by the x
variable
Influential Observations
• Correlation r is not resistant. One unusual point in the
scatterplot greatly affects the value of r.
• Least-Squares Regression Line also not resistant; a point
extreme in the x direction with no other points near it pulls
the line toward itself. This point is influential.