Chapter 11annegloag.weebly.com/uploads/2/2/9/9/22998796/chapter11.pdf · in the table. Using a TI-83/84, display the data in a scatter plot. Determine the type of correlation. Duration

Chapter 11

Correlation and Regression

Correlation

A relationship between two variables.

The data can be represented by ordered pairs (x, y)

x is the independent (or explanatory) variable

y is the dependent (or response) variable

A scatter plot can be used to determine whether a linear

(straight line) correlation exists between two variables.

Example:

x 1 2 3 4 5

y – 4 – 2 – 1 0 2

x

2 4

–2

– 4

y

2

6

Types of Correlation

x

y

As x increases, y tends to

decrease.

Negative Linear Correlation

As x increases, y tends to

increase.

x

y

Positive Linear Correlation

x

y

No Correlation

x

y

Nonlinear Correlation

Example: Constructing a Scatter Plot

An economist wants to determine whether there is a linear relationship between a country’s gross domestic

product (GDP)

and carbon dioxide (CO2) emissions. The data are shown in the table. Display the data in a scatter plot and

determine whether there appears to be a positive or negative linear correlation or no linear correlation. (Source:

World Bank and U.S. Energy Information Administration)

GDP

(trillions of

$), x

CO2

emission

(millions

of metric

tons), y

1.6 428.2

3.6 828.8

4.9 1214.2

1.1 444.6

0.9 264.0

2.9 415.3

2.7 571.8

2.3 454.9

1.6 358.7

1.5 573.5

Solution: Constructing a Scatter Plot

Appears to be a positive linear correlation. As the

gross domestic products increase, the carbon dioxide

emissions tend to increase.

Example: Constructing a Scatter Plot Using Technology

Old Faithful, located in Yellowstone National Park, is the world’s most

famous geyser. The duration (in minutes) of several of Old Faithful’s

eruptions and the times (in minutes) until the next eruption are shown

in the table. Using a TI-83/84, display the data in a scatter plot.

Determine the type of correlation.

Duration

x

Time,

y

Duration

x

Time,

y

1.80 56 3.78 79

1.82 58 3.83 85

1.90 62 3.88 80

1.93 56 4.10 89

1.98 57 4.27 90

2.05 57 4.30 89

2.13 60 4.43 89

2.30 57 4.47 86

2.37 61 4.53 89

2.82 73 4.55 86

3.13 76 4.60 92

3.27 77 4.63 91

3.65 77

Enter the x-values into list L1 and the y-values into list L2.

Use Stat Plot to construct the scatter plot.

STAT >

Edit…

STATPLOT

1 5 50

10

0 From the scatter plot, it appears that

the variables have a positive linear

correlation.

Correlation coefficient

A measure of the strength and the direction of a linear relationship between two variables.

The symbol r represents the sample correlation coefficient.

A formula for r is

The population correlation coefficient is represented by ρ (rho).

The range of the correlation coefficient is –1 to 1.

2 22 2

n xy x yr

n x x n y y

n is the number of data pairs

-1 0 1

If r = –1 there is a perfect

negative correlation

If r is close to 0 there is no

linear correlation

If r = 1 there is a perfect

positive correlation

Linear Correlation

Strong negative correlation Weak positive correlation Strong positive correlation Nonlinear Correlation

x

y

x

y

x

y

x

y

r = –0.91 r = 0.88 r = 0.42 r = 0.07

Using Technology to Find a Correlation Coefficient

To calculate r, you must fenter the

LinREgTTest command found in the Calc menu

STAT > Calc

r ≈ 0.979 suggests a strong positive correlation.

Use a technology tool to calculate the correlation coefficient for the Old Faithful data. What can you

conclude?

Using a Table to Test a Population Correlation Coefficient ρ

Once the sample correlation coefficient r has been calculated, we need to determine whether there is enough

evidence to decide that the population correlation coefficient ρ is significant at a specified level of significance.

Use correlation significance table

If |r| is greater than the critical value, there is enough evidence to decide that the correlation coefficient ρ is

significant.

Example:

Determine whether ρ is significant for five pairs of data (n = 5) at a level of

significance of α = 0.01.

If |r| > 0.959, the correlation is significant. Otherwise, there is not enough evidence to

conclude that the correlation is significant.

level of significance Number of pairs

of data in sample

Example: Using the Old Faithful

data, you used 25 pairs of data to

find

r ≈ 0.979. Is the correlation

coefficient significant? Use

α = 0.05.

Solution: n = 25, α = 0.05

|r| ≈ 0.979 > 0.396

There is enough evidence at the 5% level of significance to

conclude that there is a significant linear correlation

between the duration of Old Faithful’s eruptions and the

time between eruptions.

What the VALUE of r tells us:

The value of r is always between -1 and +1: -1≤ r ≤1.

The size of the correlation r indicates the strength of the linear relationship between x and y. Values of r close

to -1 or to +1 indicate a stronger linear relationship between x and y.

If r = 0 there is absolutely no linear relationship between x and y (no linear correlation).

If r = 1, there is perfect positive correlation. If r = -1, there is perfect negative correlation. In both these cases,

all of the original data points lie on a straight line. Of course, in the real world, this will not generally

happen.

What the SIGN of r tells us

A positive value of r means that when x increases, y tends to increase and when x decreases, y tends to

decrease (positive correlation).

A negative value of r means that when x increases, y tends to decrease and when x decreases, y tends to

increase (negative correlation).

The sign of r is the same as the sign of the slope, b, of the best fit line

r2 is called the coefficient of determination. r2 is the square of the correlation coefficient, but is usually

stated as a percent, rather than in decimal form. r2 has an interpretation in the context of the data:

r2, when expressed as a percent, represents the percent of variation in the dependent variable y that can be

explained by variation in the independent variable x using the regression (best fit) line.

1 - r2, when expressed as a percent, represents the percent of variation in y that is NOT explained by

variation in x using the regression line. This can be seen as the scattering of the observed data points about

the regression line.

The Coefficient of Determination

Correlation and Causation

The fact that two variables are strongly correlated does not in itself imply a cause-and-effect relationship

between the variables.

If there is a significant correlation between two variables, you should consider the following possibilities.

Is there a direct cause-and-effect relationship between the variables?

Does x cause y?

Is there a reverse cause-and-effect relationship between the variables?

Does y cause x?

Is it possible that the relationship between the variables can be caused by a third variable or by a combination of

several other variables?

Is it possible that the relationship between two variables may be a coincidence?

Linear Regression: Regression lines

After verifying that the linear correlation between two variables is significant, next we determine the equation of

the line that best models the data (regression line).

Can be used to predict the value of y for a given value of x.

x

y

Residual

The difference between the observed y-value and the predicted y-value for a given x-value on the line.

For a given x-value,

di = (observed y-value) – (predicted y-value)

x

y

}d1

}d2

d3

{

d4{ }d5

d6{

Predicted

y-value

Observe

d y-value

Regression line (line of best fit)

The line for which the sum of the squares of the residuals is a minimum.

The equation of a regression line for an independent variable x and a dependent variable y is

ŷ = mx + b

Predicted y-value

for a given x-value Slope

y-intercept

22

n xy x ym

n x x

y xb y mx m

n n

Example: Using Technology to Find a Regression Equation

Use a technology tool to find the equation of the regression line for the Old Faithful data.

Duration

x

Time,

y

Duration

x

Time,

y

1.8 56 3.78 79

1.82 58 3.83 85

1.9 62 3.88 80

1.93 56 4.1 89

1.98 57 4.27 90

2.05 57 4.3 89

2.13 60 4.43 89

2.3 57 4.47 86

2.37 61 4.53 89

2.82 73 4.55 86

3.13 76 4.6 92

3.27 77 4.63 91

3.65 77

ˆ 12.481 33.683y x

5 50

10

0

1

Example: Finding the Equation of a Regression Line

Find the equation of the regression line for the gross domestic products and carbon dioxide emissions data.

GDP

(trillions of

$), x

CO2

emission

(millions of

metric

tons), y

1.6 428.2

3.6 828.8

4.9 1214.2

1.1 444.6

0.9 264.0

2.9 415.3

2.7 571.8

2.3 454.9

1.6 358.7

1.5 573.5

To sketch the regression line, use any two x-values within the

range of the data and calculate the corresponding y-values

from the regression line.

ŷ = 196.152x + 102.289.

Use this equation to predict the expected carbon dioxide

emissions for the following gross domestic products. (Recall

from section 9.1 that x and y have a significant linear

correlation.)

1. 1.2 trillion dollars



ŷ =196.152(1.2) + 102.289 ≈ 337.671

ŷ =196.152(2.0) + 102.289 = 494.593

ŷ =196.152(2.5) + 102.289 = 592.669

Prediction values are meaningful only for x-values in (or close to) the range of the data. The x-values in the

original data set range from 0.9 to 4.9. So, it would not be appropriate to use the regression line to predict

carbon dioxide emissions for gross domestic products such as $0.2 or $14.5 trillion dollars.

Outliers

In some data sets, there are values (observed data points) called outliers. Outliers are observed data points

that are far from the least squares line. They have large "errors", where the "error" or residual is the vertical

distance from the line to the point.

Outliers need to be examined closely. Sometimes, for some reason or another, they should not be included in the

analysis of the data. It is possible that an outlier is a result of erroneous data.

Other times, an outlier may hold valuable information about the population under study and should remain

included in the data.

The key is to carefully examine what causes a data point to be an outlier.

Besides outliers, a sample may contain one or a few points that are called influential points. Influential points

are observed data points that are far from the other observed data points in the horizontal direction.

These points may have a big effect on the slope of the regression line. To begin to identify an influential point,

you can remove it from the data set and see if the slope of the regression line is changed significantly.

Identifying Outliers

We could guess at outliers by looking at a graph of the scatterplot and best fit line. However we would like

some guideline as to how far away a point needs to be in order to be considered an outlier. As a rough rule

of thumb, we can flag any point that is located further than two standard deviations above or below the

best fit line as an outlier. The standard deviation used is the standard deviation of the residuals or errors.

We can do this visually in the scatterplot by drawing an extra pair of lines that are two standard deviations

above and below the best fit line. Any data points that are outside this extra pair of lines are flagged as

potential outliers. Or we can do this numerically by calculating each residual and comparing it to twice the

standard deviation.

A random sample of 11 statistics students produced the following data where x is the third exam score,

out of 80, and y is the final exam score, out of 200. Can you predict the final exam score of a random

student if you know the third exam score?

x (third

exam

score)

y (final exam

score)

65 175

67 133

71 185

71 163

66 126

75 198

67 153

70 163

71 159

69 151

69 159

The least squares regression line (best fit line) for the third

exam/final exam example has the equation:

yˆ = -173.51 + 4.83x

In this example, you can determine if there is an outlier or not. If there is an outlier, as an exercise, delete it and

fit the remaining data to a new line. For this example, the new line ought to fit the remaining data better. This

means the variation should be smaller and the correlation coefficient ought to be closer to 1 or -1.

Using the LinRegTTest with this data, scroll down through the output screens to find s = 16.412

Line Y2 = -173.5 + 4.83x - 2(16.4) and

line Y3 = -173.5 + 4.83x + 2(16.4)

where yˆ = -173.5 + 4.83x is the line of best fit.

Y2 and Y3 have the same slope as the line of best fit.

Graph the scatterplot with the best fit line in equation Y1, then enter the two extra lines as Y2 and

Y3 in the "Y="equation editor and press ZOOM 9. You will find that the only data point that is not

between lines Y2 and Y3 is the point x=65, y=175. On the calculator screen it is just barely outside

these lines. The outlier is the student who had a grade of 65 on the third exam and 175 on the final

exam; this point is further than 2 standard deviations away from the best fit line.

Numerical Identification of Outliers

In the table below, the first two columns are the third exam and final exam data. The third column shows the

predicted yˆ values calculated from the line of best fit: yˆ= -173.5 + 4.83x.

The residuals, or errors, have been calculated in the fourth column of the table:

observed y value−predicted y value=y−yˆ

s is the standard deviation of all the y−yˆ values where n = the total number of data points.

x y yˆ y−yˆ

65 175 140 175−140=35

67 133 150 133−150=-17

71 185 169 185−169=16

71 163 169 163−169=-6

66 126 145 126−145=-19

75 198 189 198−189=9

67 153 150 153−150=3

70 163 164 163−164=-1

71 159 169 159−169=-10

69 151 160 151−160=-9

69 159 160 159−160=-1

For this example, the calculator function LinRegTTest

found s = 16.4 as the standard deviation of the residuals

35; -17; 16;-6; -19; 9; 3; -1; -10; -9; -1.

We are looking for all data points for which the residual is

greater than 2s = 2(16.4) = 32.8 or less than -32.8.

Compare these values to the residuals in column 4 of the

table. The only such data point is the student who had a

grade of 65 on the third exam and 175 on the final exam;

the residual for this student is 35.

How does the outlier affect the best fit line?

Numerically and graphically, we have identified the point (65,175) as an outlier. We should re-examine the

data for this point to see if there are any problems with the data. If there is an error we should fix the error if

possible, or delete the data. If the data is correct, we would leave it in the data set.

For this problem, we will suppose that we examined the data and found that this outlier data was an

error. Therefore we will continue on and delete the outlier, so that we can explore how it affects the

results, as a learning experience.

Compute a new best-fit line and correlation coefficient using the 10 remaining points:

On the TI-83, TI-83+, TI-84+ calculators, delete the outlier from L1 and L2. Using the LinRegTTest, the new

line of best fit and the correlation coefficient are:

yˆ= -355.19 + 7.39x and r = 0.9121

The new line with r = 0.9121 is a stronger correlation than the original (r = 0.6631) because r = 0.9121 is

closer to 1. This means that the new line is a better fit to the 10 remaining data values. The line can better

predict the final exam score given the third exam score.

EXAMPLE: Using this new line of best fit (based on the remaining 10 data points), what would a student who

receives a 73 on the third exam expect to receive on the final exam? Is this the same as the prediction made using

the original line?

SOLUTION: Using the new line of best fit, yˆ=-355.19+7.39(73)= 184.28.

A student who scored 73 points on the third exam would expect to earn 184 points on the final exam.

The original line predicted yˆ= -173.51 + 4.83(73) = 179.08 so the prediction using the new line with the outlier

eliminated differs from the original prediction.

Documents

Chapter 11annegloag.weebly.com/uploads/2/2/9/9/22998796/chapter11.pdf · in the table. Using a TI-83/84, display the data in a scatter plot. Determine the type of correlation. Duration