Upload
lee-atkins
View
216
Download
2
Embed Size (px)
Citation preview
Chapter 5
Regression
Chapter 5 2
Objective: To quantify the linear relationship between an explanatory variable (x) and response variable (y).
We can then predict the average response for all subjects with a given value of the explanatory variable.
Linear Regression
Regression Line
A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x
Chapter 5 3
Example – Staying Thin
STATE: Some people don’t gain weight even when they overeat. Maybe differences in “nonexercise activity” account for this.
Researchers deliberately fed 16 healthy young adults for eight weeks and measured fat gain (kg) and changes in energy use (cal) from nonexercise activities.
Chapter 5 4
Example – Staying Thin
Question: Do people with large increases in nonexercise activity (NEA) gain less weight?
Chapter 5 5
NEA change (cal) -94 -57 -29 135 143 151 245 355Fat gain (kg) 4.2 3.0 3.7 2.7 3.2 3.6 2.4 1.3NEA change (cal) 392 473 486 535 571 580 620 690Fat gain (kg) 3.8 1.7 1.6 2.2 1.0 0.4 2.3 1.1
Example – Staying Thin
PLAN: Make a scatterplot of the data and examine the pattern. If it’s linear, use correlation to measure the strength and direction. Draw a regression line on the plot to predict fat gain from change in NEA.
Chapter 5 6
Chapter 5 7
Example – Staying Thin
SOLVE: The scatterplot shows a moderately strong negative correlation: r = -0.78. The line on the plot is the regression line that predicts weight gain from changes in NEA
Chapter 5 8
Chapter 5 9
Least Squares
Used to determine the “best” line
We want the line to be as close as possible to the data points in the vertical (y) direction (since that is what we are trying to predict)
Least Squares: use the line that minimizes the sum of the squares of the vertical distances of the data points from the line
Straight Lines
A straight line relating response variable y to explanatory variable x is given by:
b is the slope, the amount y changes when x increases by one unit
a is the intercept, the value of y when x = 0
Chapter 5 10
y a bx
Chapter 5 11
Least Squares Regression Line
Regression equation: y = a + bx^
– x is the value of the explanatory variable– “y-hat” is the average value of the response
variable (predicted response for a value of x)
– note that a and b are just the intercept and slope of a straight line
– note that r and b are not the same thing, but their signs will agree
Example – Using Regression Line
Any line describing the weight gain-NEA data has the form:
The line in the plot has specific values for a and b to produce this regression line:
Chapter 5 12
fat gain NEA changea b
fat gain 3.505 0.00344 NEA change
Chapter 5 13
Example – Using Regression Line
The slope b = -0.00344 says that for each added calorie of NEA, fat gain goes down by 0.00344 kg. This is the rate of change of fat gain with changes in NEA.
The intercept a = 3.505 says that our estimate of the amount of weight gained for someone whose NEA doesn’t change at all (NEA change = 0) is 3.505 kg.
Chapter 5 14
Regression Line Calculation
y
x
sb r
s
a y bx
Chapter 5 15
^ Regression equation: y = a + bx
where sx and sy are the standard deviations of the two variables, and r is their correlation
Chapter 5 16
Regression CalculationCase Study
Per Capita Gross Domestic Productand Average Life Expectancy for
Countries in Western Europe
Chapter 5 17
Country Per Capita GDP (x) Life Expectancy (y)
Austria 21.4 77.48
Belgium 23.2 77.53
Finland 20.0 77.32
France 22.7 78.63
Germany 20.8 77.17
Ireland 18.6 76.39
Italy 21.5 78.51
Netherlands 22.0 78.15
Switzerland 23.8 78.99
United Kingdom 21.2 77.37
Regression CalculationCase Study
Regression CalculationCase Study
21.52 77.754 0.809
1.532 0.795x y
x y r
s s
Chapter 5 18
Linear regression equation:
0.795(0.809) 0.420
1.532
77.754 - (0.420)(21.52) 68.716
y
x
sb r
s
a y bx
ˆ 68.716 0.420y x
Chapter 5 19
Coefficient of Determination (R2)
Measures usefulness of regression prediction R2 (or r2, the square of the correlation): measures
what fraction of the variation in the values of the response variable (y) is explained by the regression line r=1: R2=1: regression line explains all (100%) of
the variation in y r=.7: R2=.49: regression line explains almost half
(50%) of the variation in y
Example – Using R2
For the NEA data (predicting weight gain with changes in NEA):
That is, about 61% of the variation in weight gained is accounted for by the linear relationship with changes in NEA. The other 39% or so is not explained by this relationship.Chapter 5 20
2 2
0.7786
( 0.7786) 0.6062
r
r
Chapter 5 21
Residuals
A residual is the difference between an observed value of the response variable and the value predicted by the regression line:
residual = y y
Chapter 5 22
Residuals
A residual plot is a scatterplot of the regression residuals against the explanatory variable
– used to assess the fit of a regression line
– look for a “random” scatter around zero
Example - Empathy
“Empathy” is understanding how others feel To investigate empathy, researchers
measured brain activity in women who watched their partner receive a harmless but painful electric shock and compared that to their score on an ‘empathy test’
Question: do women who score high on empathy also respond more to watching their partner being zapped?
Chapter 5 23
Example 5.5 - Empathy
The next slide displays a scatterplot of the data collected by the researchers
There’s a positive association – as we suspected
The overall pattern is moderately linear with
The line on the plot is the least squares line of brain activity vs. empathy score
Chapter 5 24
0.515r
Chapter 5 25
Example 5.5 - Empathy
The least squares regression line for brain activity vs. empathy score is given by:
For subject 1 whose values are empathy score = 38 and brain activity value = -0.120, we predict:
Chapter 5 26
ˆ 0.0578 0.00761y x
ˆ 0.0578 0.00761 38 0.231y
Example 5.5 - Empathy
So the residual for subject 1 the residual is:
The residual is negative because the regression line is above the observed value
Chapter 5 27
observed - predicted
ˆ-
0.120
Res
0
i
.23
dual
0 3
1
. 51
y y
Chapter 5 28
Example 5.5 - Empathy
We can plot the residuals themselves against the explanatory variable to create a residual plot
The residual plot helps to show how well the regression line describes the data
The horizontal red “residual = 0” line corresponds to the regression line in the scatterplot
Chapter 5 29
Chapter 5 30
Chapter 5 31
Outliers and Influential Points
An outlier is an observation that lies far away from the other observations
– outliers in the y direction have large residuals
– outliers in the x direction are often influential for the least-squares regression line, meaning that the removal of such points would markedly change the equation of the line
Example – Influential Observation
Subject 16 in the empathy data seems to be an outlier:– Removing subject 16 from the calculation of r
reduces r from 0.515 to 0.331 – yes, it is influential in this sense
– Calculating r = 0.515 isn’t useful because it depends so much on subject 16
Question: does subject 16 also have a big influence on the least squares regression line?
Chapter 5 32
Chapter 5 33
Example – Influential Observation
As you can see in plot on the previous slide, removing the point for subject 16 from the calculation of the least squares regression line has almost no effect – so it is not influential where it is ….
However, if we move it around, it does become influential
Chapter 5 34
Chapter 5 35
Example – Influential Observation
When we move the point for subject 16 straight down, it pulls the regression line with it
when an outlier does not lie close to the regression line determined by the rest of the points in the dataset, it is influential
in the regression setting, not all outliers are influential
Chapter 5 36
Chapter 5 37
Cautionsabout Correlation and Regression
only describe linear relationships are both affected by outliers always plot the data before interpreting beware of extrapolation
– predicting outside of the range of x
beware of lurking variables– have important effect on the relationship among the
variables in a study, but are not included in the study
association does not imply causation
Chapter 5 38
Evidence of Causation A properly conducted experiment establishes
the connection Other considerations:
– The association is strong– The association is consistent
The connection happens in repeated trials The connection happens under varying conditions
– Higher doses are associated with stronger responses– Alleged cause precedes the effect in time– Alleged cause is plausible (reasonable explanation)
∞