Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Chapter 11
Correlation and Regression
Correlation
A relationship between two variables.
The data can be represented by ordered pairs (x, y)
x is the independent (or explanatory) variable
y is the dependent (or response) variable
A scatter plot can be used to determine whether a linear
(straight line) correlation exists between two variables.
Example:
x 1 2 3 4 5
y – 4 – 2 – 1 0 2
x
2 4
–2
– 4
y
2
6
Types of Correlation
x
y
As x increases, y tends to
decrease.
Negative Linear Correlation
As x increases, y tends to
increase.
x
y
Positive Linear Correlation
x
y
No Correlation
x
y
Nonlinear Correlation
Example: Constructing a Scatter Plot
An economist wants to determine whether there is a linear relationship between a country’s gross domestic
product (GDP)
and carbon dioxide (CO2) emissions. The data are shown in the table. Display the data in a scatter plot and
determine whether there appears to be a positive or negative linear correlation or no linear correlation. (Source:
World Bank and U.S. Energy Information Administration)
GDP
(trillions of
$), x
CO2
emission
(millions
of metric
tons), y
1.6 428.2
3.6 828.8
4.9 1214.2
1.1 444.6
0.9 264.0
2.9 415.3
2.7 571.8
2.3 454.9
1.6 358.7
1.5 573.5
Solution: Constructing a Scatter Plot
Appears to be a positive linear correlation. As the
gross domestic products increase, the carbon dioxide
emissions tend to increase.
Example: Constructing a Scatter Plot Using Technology
Old Faithful, located in Yellowstone National Park, is the world’s most
famous geyser. The duration (in minutes) of several of Old Faithful’s
eruptions and the times (in minutes) until the next eruption are shown
in the table. Using a TI-83/84, display the data in a scatter plot.
Determine the type of correlation.
Duration
x
Time,
y
Duration
x
Time,
y
1.80 56 3.78 79
1.82 58 3.83 85
1.90 62 3.88 80
1.93 56 4.10 89
1.98 57 4.27 90
2.05 57 4.30 89
2.13 60 4.43 89
2.30 57 4.47 86
2.37 61 4.53 89
2.82 73 4.55 86
3.13 76 4.60 92
3.27 77 4.63 91
3.65 77
Enter the x-values into list L1 and the y-values into list L2.
Use Stat Plot to construct the scatter plot.
STAT >
Edit…
STATPLOT
1 5 50
10
0 From the scatter plot, it appears that
the variables have a positive linear
correlation.
Correlation coefficient
A measure of the strength and the direction of a linear relationship between two variables.
The symbol r represents the sample correlation coefficient.
A formula for r is
The population correlation coefficient is represented by ρ (rho).
The range of the correlation coefficient is –1 to 1.
2 22 2
n xy x yr
n x x n y y
n is the number of data pairs
-1 0 1
If r = –1 there is a perfect
negative correlation
If r is close to 0 there is no
linear correlation
If r = 1 there is a perfect
positive correlation
Linear Correlation
Strong negative correlation Weak positive correlation Strong positive correlation Nonlinear Correlation
x
y
x
y
x
y
x
y
r = –0.91 r = 0.88 r = 0.42 r = 0.07
Using Technology to Find a Correlation Coefficient
To calculate r, you must fenter the
LinREgTTest command found in the Calc menu
STAT > Calc
r ≈ 0.979 suggests a strong positive correlation.
Use a technology tool to calculate the correlation coefficient for the Old Faithful data. What can you
conclude?
Using a Table to Test a Population Correlation Coefficient ρ
Once the sample correlation coefficient r has been calculated, we need to determine whether there is enough
evidence to decide that the population correlation coefficient ρ is significant at a specified level of significance.
Use correlation significance table
If |r| is greater than the critical value, there is enough evidence to decide that the correlation coefficient ρ is
significant.
Example:
Determine whether ρ is significant for five pairs of data (n = 5) at a level of
significance of α = 0.01.
If |r| > 0.959, the correlation is significant. Otherwise, there is not enough evidence to
conclude that the correlation is significant.
level of significance Number of pairs
of data in sample
Example: Using the Old Faithful
data, you used 25 pairs of data to
find
r ≈ 0.979. Is the correlation
coefficient significant? Use
α = 0.05.
Solution: n = 25, α = 0.05
|r| ≈ 0.979 > 0.396
There is enough evidence at the 5% level of significance to
conclude that there is a significant linear correlation
between the duration of Old Faithful’s eruptions and the
time between eruptions.
What the VALUE of r tells us:
The value of r is always between -1 and +1: -1≤ r ≤1.
The size of the correlation r indicates the strength of the linear relationship between x and y. Values of r close
to -1 or to +1 indicate a stronger linear relationship between x and y.
If r = 0 there is absolutely no linear relationship between x and y (no linear correlation).
If r = 1, there is perfect positive correlation. If r = -1, there is perfect negative correlation. In both these cases,
all of the original data points lie on a straight line. Of course, in the real world, this will not generally
happen.
What the SIGN of r tells us
A positive value of r means that when x increases, y tends to increase and when x decreases, y tends to
decrease (positive correlation).
A negative value of r means that when x increases, y tends to decrease and when x decreases, y tends to
increase (negative correlation).
The sign of r is the same as the sign of the slope, b, of the best fit line
r2 is called the coefficient of determination. r2 is the square of the correlation coefficient, but is usually
stated as a percent, rather than in decimal form. r2 has an interpretation in the context of the data:
r2, when expressed as a percent, represents the percent of variation in the dependent variable y that can be
explained by variation in the independent variable x using the regression (best fit) line.
1 - r2, when expressed as a percent, represents the percent of variation in y that is NOT explained by
variation in x using the regression line. This can be seen as the scattering of the observed data points about
the regression line.
The Coefficient of Determination
Correlation and Causation
The fact that two variables are strongly correlated does not in itself imply a cause-and-effect relationship
between the variables.
If there is a significant correlation between two variables, you should consider the following possibilities.
Is there a direct cause-and-effect relationship between the variables?
Does x cause y?
Is there a reverse cause-and-effect relationship between the variables?
Does y cause x?
Is it possible that the relationship between the variables can be caused by a third variable or by a combination of
several other variables?
Is it possible that the relationship between two variables may be a coincidence?
Linear Regression: Regression lines
After verifying that the linear correlation between two variables is significant, next we determine the equation of
the line that best models the data (regression line).
Can be used to predict the value of y for a given value of x.
x
y
Residual
The difference between the observed y-value and the predicted y-value for a given x-value on the line.
For a given x-value,
di = (observed y-value) – (predicted y-value)
x
y
}d1
}d2
d3
{
d4{ }d5
d6{
Predicted
y-value
Observe
d y-value
Regression line (line of best fit)
The line for which the sum of the squares of the residuals is a minimum.
The equation of a regression line for an independent variable x and a dependent variable y is
ŷ = mx + b
Predicted y-value
for a given x-value Slope
y-intercept
22
n xy x ym
n x x
y xb y mx m
n n
Example: Using Technology to Find a Regression Equation
Use a technology tool to find the equation of the regression line for the Old Faithful data.
Duration
x
Time,
y
Duration
x
Time,
y
1.8 56 3.78 79
1.82 58 3.83 85
1.9 62 3.88 80
1.93 56 4.1 89
1.98 57 4.27 90
2.05 57 4.3 89
2.13 60 4.43 89
2.3 57 4.47 86
2.37 61 4.53 89
2.82 73 4.55 86
3.13 76 4.6 92
3.27 77 4.63 91
3.65 77
ˆ 12.481 33.683y x
5 50
10
0
1
Example: Finding the Equation of a Regression Line
Find the equation of the regression line for the gross domestic products and carbon dioxide emissions data.
GDP
(trillions of
$), x
CO2
emission
(millions of
metric
tons), y
1.6 428.2
3.6 828.8
4.9 1214.2
1.1 444.6
0.9 264.0
2.9 415.3
2.7 571.8
2.3 454.9
1.6 358.7
1.5 573.5
To sketch the regression line, use any two x-values within the
range of the data and calculate the corresponding y-values
from the regression line.
ŷ = 196.152x + 102.289.
Use this equation to predict the expected carbon dioxide
emissions for the following gross domestic products. (Recall
from section 9.1 that x and y have a significant linear
correlation.)
1. 1.2 trillion dollars
2. 2.0 trillion dollars
3. 2.5 trillion dollars
ŷ =196.152(1.2) + 102.289 ≈ 337.671
ŷ =196.152(2.0) + 102.289 = 494.593
ŷ =196.152(2.5) + 102.289 = 592.669
Prediction values are meaningful only for x-values in (or close to) the range of the data. The x-values in the
original data set range from 0.9 to 4.9. So, it would not be appropriate to use the regression line to predict
carbon dioxide emissions for gross domestic products such as $0.2 or $14.5 trillion dollars.
Outliers
In some data sets, there are values (observed data points) called outliers. Outliers are observed data points
that are far from the least squares line. They have large "errors", where the "error" or residual is the vertical
distance from the line to the point.
Outliers need to be examined closely. Sometimes, for some reason or another, they should not be included in the
analysis of the data. It is possible that an outlier is a result of erroneous data.
Other times, an outlier may hold valuable information about the population under study and should remain
included in the data.
The key is to carefully examine what causes a data point to be an outlier.
Besides outliers, a sample may contain one or a few points that are called influential points. Influential points
are observed data points that are far from the other observed data points in the horizontal direction.
These points may have a big effect on the slope of the regression line. To begin to identify an influential point,
you can remove it from the data set and see if the slope of the regression line is changed significantly.
Identifying Outliers
We could guess at outliers by looking at a graph of the scatterplot and best fit line. However we would like
some guideline as to how far away a point needs to be in order to be considered an outlier. As a rough rule
of thumb, we can flag any point that is located further than two standard deviations above or below the
best fit line as an outlier. The standard deviation used is the standard deviation of the residuals or errors.
We can do this visually in the scatterplot by drawing an extra pair of lines that are two standard deviations
above and below the best fit line. Any data points that are outside this extra pair of lines are flagged as
potential outliers. Or we can do this numerically by calculating each residual and comparing it to twice the
standard deviation.
A random sample of 11 statistics students produced the following data where x is the third exam score,
out of 80, and y is the final exam score, out of 200. Can you predict the final exam score of a random
student if you know the third exam score?
x (third
exam
score)
y (final exam
score)
65 175
67 133
71 185
71 163
66 126
75 198
67 153
70 163
71 159
69 151
69 159
The least squares regression line (best fit line) for the third
exam/final exam example has the equation:
yˆ = -173.51 + 4.83x
In this example, you can determine if there is an outlier or not. If there is an outlier, as an exercise, delete it and
fit the remaining data to a new line. For this example, the new line ought to fit the remaining data better. This
means the variation should be smaller and the correlation coefficient ought to be closer to 1 or -1.
Using the LinRegTTest with this data, scroll down through the output screens to find s = 16.412
Line Y2 = -173.5 + 4.83x - 2(16.4) and
line Y3 = -173.5 + 4.83x + 2(16.4)
where yˆ = -173.5 + 4.83x is the line of best fit.
Y2 and Y3 have the same slope as the line of best fit.
Graph the scatterplot with the best fit line in equation Y1, then enter the two extra lines as Y2 and
Y3 in the "Y="equation editor and press ZOOM 9. You will find that the only data point that is not
between lines Y2 and Y3 is the point x=65, y=175. On the calculator screen it is just barely outside
these lines. The outlier is the student who had a grade of 65 on the third exam and 175 on the final
exam; this point is further than 2 standard deviations away from the best fit line.
Numerical Identification of Outliers
In the table below, the first two columns are the third exam and final exam data. The third column shows the
predicted yˆ values calculated from the line of best fit: yˆ= -173.5 + 4.83x.
The residuals, or errors, have been calculated in the fourth column of the table:
observed y value−predicted y value=y−yˆ
s is the standard deviation of all the y−yˆ values where n = the total number of data points.
x y yˆ y−yˆ
65 175 140 175−140=35
67 133 150 133−150=-17
71 185 169 185−169=16
71 163 169 163−169=-6
66 126 145 126−145=-19
75 198 189 198−189=9
67 153 150 153−150=3
70 163 164 163−164=-1
71 159 169 159−169=-10
69 151 160 151−160=-9
69 159 160 159−160=-1
For this example, the calculator function LinRegTTest
found s = 16.4 as the standard deviation of the residuals
35; -17; 16;-6; -19; 9; 3; -1; -10; -9; -1.
We are looking for all data points for which the residual is
greater than 2s = 2(16.4) = 32.8 or less than -32.8.
Compare these values to the residuals in column 4 of the
table. The only such data point is the student who had a
grade of 65 on the third exam and 175 on the final exam;
the residual for this student is 35.
How does the outlier affect the best fit line?
Numerically and graphically, we have identified the point (65,175) as an outlier. We should re-examine the
data for this point to see if there are any problems with the data. If there is an error we should fix the error if
possible, or delete the data. If the data is correct, we would leave it in the data set.
For this problem, we will suppose that we examined the data and found that this outlier data was an
error. Therefore we will continue on and delete the outlier, so that we can explore how it affects the
results, as a learning experience.
Compute a new best-fit line and correlation coefficient using the 10 remaining points:
On the TI-83, TI-83+, TI-84+ calculators, delete the outlier from L1 and L2. Using the LinRegTTest, the new
line of best fit and the correlation coefficient are:
yˆ= -355.19 + 7.39x and r = 0.9121
The new line with r = 0.9121 is a stronger correlation than the original (r = 0.6631) because r = 0.9121 is
closer to 1. This means that the new line is a better fit to the 10 remaining data values. The line can better
predict the final exam score given the third exam score.
EXAMPLE: Using this new line of best fit (based on the remaining 10 data points), what would a student who
receives a 73 on the third exam expect to receive on the final exam? Is this the same as the prediction made using
the original line?
SOLUTION: Using the new line of best fit, yˆ=-355.19+7.39(73)= 184.28.
A student who scored 73 points on the third exam would expect to earn 184 points on the final exam.
The original line predicted yˆ= -173.51 + 4.83(73) = 179.08 so the prediction using the new line with the outlier
eliminated differs from the original prediction.