16
Lecture 11 Lecture 11 Chapter 6. Correlation Chapter 6. Correlation and Linear Regression and Linear Regression

Lecture 11 Chapter 6. Correlation and Linear Regression

Embed Size (px)

DESCRIPTION

Lecture 11 Chapter 6. Correlation and Linear Regression. 6.1 Introduction This chapter is concerned with relationships between continuous variables . Example (see Handout 11) - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture 11 Chapter 6. Correlation and Linear Regression

Lecture 11Lecture 11

Chapter 6. Correlation and Chapter 6. Correlation and Linear RegressionLinear Regression

Page 2: Lecture 11 Chapter 6. Correlation and Linear Regression

6.1 Introduction6.1 Introduction

This chapter is concerned with relationships This chapter is concerned with relationships between between continuous variablescontinuous variables..

ExampleExample (see Handout 11) (see Handout 11)During the 1950s radioactive water leaked into the Columbia river During the 1950s radioactive water leaked into the Columbia river in Washington DC. Data were collected on an exposure index (X), in Washington DC. Data were collected on an exposure index (X), and the cancer mortality rate (Y) (deaths per 100,000 per year) and the cancer mortality rate (Y) (deaths per 100,000 per year) for the years 1959-1964, for each of for the years 1959-1964, for each of ninenine counties downstream: counties downstream:

Exposure (x):Exposure (x): 8.3 6.4 3.4 3.8 2.6 11.6 1.2 2.5 8.3 6.4 3.4 3.8 2.6 11.6 1.2 2.5 1.61.6

Mortality (y): Mortality (y): 210 180 130 170 130 210 120 150 210 180 130 170 130 210 120 150 140140

Page 3: Lecture 11 Chapter 6. Correlation and Linear Regression

Both the variables X and Y are measurements on a Both the variables X and Y are measurements on a continuous scale. continuous scale.

We are interested in how these two variables are We are interested in how these two variables are related,related, or or associatedassociated..

As usual, the sensible thing to do first is to have a As usual, the sensible thing to do first is to have a looklook at the data. The best thing to do here is to at the data. The best thing to do here is to plot the mortality rate against the exposure plot the mortality rate against the exposure index....index....

Page 4: Lecture 11 Chapter 6. Correlation and Linear Regression

1050

210

200

190

180

170

160

150

140

130

120

Exposure

Mo

rtal

ity

Page 5: Lecture 11 Chapter 6. Correlation and Linear Regression

The plot suggests that there is a clear The plot suggests that there is a clear relationshiprelationship ((associationassociation) between the mortality rate and the ) between the mortality rate and the exposure index. The relationship looks approximately exposure index. The relationship looks approximately linearlinear (like a straight line). (like a straight line).

In this chapter we do two things:In this chapter we do two things:

1.1. Use a measure called Use a measure called correlationcorrelation to describe the to describe the strength of the association between two variables.strength of the association between two variables.

2. 2. Use a method called Use a method called linear regressionlinear regression to model the to model the relationship between two variables which are relationship between two variables which are associated in a way which is associated in a way which is approximately linearapproximately linear. .

Page 6: Lecture 11 Chapter 6. Correlation and Linear Regression

6.2 Correlation6.2 Correlation

There are a several different measures of association in There are a several different measures of association in usage, but we will only consider the most common, which is usage, but we will only consider the most common, which is called called Pearson’s product moment correlation coefficient Pearson’s product moment correlation coefficient or or more briefly the more briefly the sample linear correlation coefficient sample linear correlation coefficient or just or just thethe Pearson correlation Pearson correlation. It is usually denoted by the letter . It is usually denoted by the letter rr. .

Page 7: Lecture 11 Chapter 6. Correlation and Linear Regression

Additional Notes (Slide 1 of 2)Additional Notes (Slide 1 of 2)

• The value of r always lies between -1 and +1;

• Values of r near to +1 indicate a strong positive linear relationship;

• Values of r near to -1 indicate a strong negative linear relationship;

• Values of r near to 0 indicate there is very little linear relationship.

Page 8: Lecture 11 Chapter 6. Correlation and Linear Regression

Additional Notes (Slide 2 of 2)Additional Notes (Slide 2 of 2)

• Let’s see what Minitab tells us about the Pearson correlation for our example above. We use:

Stat>Basic Statistics>Correlation...

Minitab tells us two things:

• the Pearson correlation is r = 0.917• the P-value is 0.000

Page 9: Lecture 11 Chapter 6. Correlation and Linear Regression

Note that this correlation is close to +1, indicating a Note that this correlation is close to +1, indicating a strong positive linear relationship.strong positive linear relationship.

What about the p-value? What about the p-value?

This is the result of the hypothesis test of the null This is the result of the hypothesis test of the null hypothesis:hypothesis:

HH00: The linear correlation in the : The linear correlation in the populationpopulation is zero. is zero.

Our value of p = 0.000 indicates that we Our value of p = 0.000 indicates that we rejectreject the null the null hypothesis. There does appear to be a strong positive hypothesis. There does appear to be a strong positive linear relationship between exposure and mortality.linear relationship between exposure and mortality.

Page 10: Lecture 11 Chapter 6. Correlation and Linear Regression

The correlation coefficient The correlation coefficient rr is a very useful summary is a very useful summary measure, but it us often misused. Some points to measure, but it us often misused. Some points to remember are as follows:remember are as follows:

1.1. A high correlation does not necessarily imply a a A high correlation does not necessarily imply a a cause-and-effect relationship. cause-and-effect relationship.

2.2. Although a value of Although a value of rr close to 1 does indicate a close to 1 does indicate a strong positive linear association, a linear strong positive linear association, a linear relationship is not always the most appropriate. relationship is not always the most appropriate. Always produce a plot of y against x. Always produce a plot of y against x.

3. 3. A value close to zero indicates no A value close to zero indicates no linear relationshiplinear relationship. . That does not necessarily mean there is no That does not necessarily mean there is no relationship! relationship!

Page 11: Lecture 11 Chapter 6. Correlation and Linear Regression

For the data plotted below, For the data plotted below, rr = 0.020, and the p-value is = 0.020, and the p-value is 0.854. This correctly identifies there is no 0.854. This correctly identifies there is no linearlinear relationship, relationship, but there clearly but there clearly isis a relationship! a relationship!

20100

100

50

0

x

y

Page 12: Lecture 11 Chapter 6. Correlation and Linear Regression

6.3 Simple Linear Regression6.3 Simple Linear Regression

The correlation coefficient tells us about the strength of The correlation coefficient tells us about the strength of a linear relationship, but it doesn’t allow us to do things a linear relationship, but it doesn’t allow us to do things like like make predictionsmake predictions about new data. about new data.

For this we need a For this we need a model model for the data. If we think there for the data. If we think there is an approximately linear relationship, we use the is an approximately linear relationship, we use the equation of a equation of a straight line,straight line, which relates X and Y: which relates X and Y:

Y = Y = αα + + ββXX

Here the values of Here the values of αα (alpha) and (alpha) and ββ (beta) are the (beta) are the interceptintercept and the and the slopeslope of the straight line respectively. of the straight line respectively. The slope, The slope, ββ, is usually of much more interest, because , is usually of much more interest, because it tells us how Y changes with X. it tells us how Y changes with X.

Page 13: Lecture 11 Chapter 6. Correlation and Linear Regression

Since we don’t expect the data to lie Since we don’t expect the data to lie exactlyexactly on on a straight line, we always add a random error a straight line, we always add a random error component, component, εε (epsilon), so the equation (epsilon), so the equation becomes:becomes:

Y = Y = αα + + ββX + X + εε (Equation 1)(Equation 1)

Equation 1 is the equation of a Equation 1 is the equation of a simple linear simple linear regressionregression. In order to use it to model our data, . In order to use it to model our data, we need to choose the values of we need to choose the values of αα and and ββ which which work best.work best.

E.g. for the exposure-mortality data, we might E.g. for the exposure-mortality data, we might obtain....obtain....

Page 14: Lecture 11 Chapter 6. Correlation and Linear Regression

10 5 0

220

170

120

Exposure

Mo

rtal

ity

S = 14.5763 R-Sq = 84.2 % R-Sq(adj) = 81.9 %

Mortality = 118.449 + 9.03279 Exposure

Regression Plot

Page 15: Lecture 11 Chapter 6. Correlation and Linear Regression

Notice that in the plot above, Notice that in the plot above, αα has been has been chosen as 118.4, and chosen as 118.4, and ββ as 9.03. as 9.03.

This indicates that in our model, the mortality This indicates that in our model, the mortality rate increases by 9.03 for every unit increase in rate increases by 9.03 for every unit increase in the exposure index, and the mortality rate when the exposure index, and the mortality rate when the exposure index is zero is 118.4.the exposure index is zero is 118.4.

But how were these values chosen?But how were these values chosen?

The usual criterion, and the one used above is to The usual criterion, and the one used above is to use the use the least squares estimatesleast squares estimates for for αα and and ββ... ...

Page 16: Lecture 11 Chapter 6. Correlation and Linear Regression

We obtain these in Minitab using:We obtain these in Minitab using:

Stat>Regression>Regression...Stat>Regression>Regression...

if we want the equation etc., and...if we want the equation etc., and...

Stat>Regression>Fitted Line Plot...Stat>Regression>Fitted Line Plot...

if we want the graph with the fitted line if we want the graph with the fitted line superimposed. superimposed.