45
Descriptive measures of the strength of a linear association r-squared and the (Pearson) correlation coefficient r

Descriptive measures of the strength of a linear association

Embed Size (px)

DESCRIPTION

Descriptive measures of the strength of a linear association. r- squared and the (Pearson) correlation coefficient r. Translating a research question into a statistical procedure. How strong is the linear relationship between skin cancer mortality and latitude? - PowerPoint PPT Presentation

Citation preview

Page 1: Descriptive measures of the strength of a  linear association

Descriptive measures of the strength of a linear association

r-squared and the (Pearson) correlation coefficient r

Page 2: Descriptive measures of the strength of a  linear association

Translating a research question into a statistical procedure

• How strong is the linear relationship between skin cancer mortality and latitude?– (Pearson) correlation coefficient r– Coefficient of determination r2

Page 3: Descriptive measures of the strength of a  linear association

Where does this topic fit in?

• Model formulation

• Model estimation

• Model evaluation

• Model use

Page 4: Descriptive measures of the strength of a  linear association

10 9 8 7 6 5 4 3 2 1 0

60

50

40

x

y

S = 7.81137 R-Sq = 6.5 % R-Sq(adj) = 3.2 %

y = 54.4758 - 0.764016 xRegression Plot

6.18271

2

n

ii yySSTO

5.1708ˆ1

2

n

iii yySSE

1.119ˆ1

2

n

ii yySSRy

y

Situation #1A very weak linear relationship

Page 5: Descriptive measures of the strength of a  linear association

0 1 2 3 4 5 6 7 8 9 10

10

20

30

40

50

60

70

80

x

y

y = 75.5458 - 5.76402 xS = 7.81137 R-Sq = 79.9 % R-Sq(adj) = 79.2 %

Regression Plot

3.6679ˆ1

2

n

ii yySSR

5.1708ˆ1

2

n

iii yySSE

8.84871

2

n

ii yySSTO

y

y

Situation #2A fairly strong linear relationship

Page 6: Descriptive measures of the strength of a  linear association

Coefficient of determination r2

SSTO

SSE

SSTO

SSRr 12

• r2 is a number (a proportion!) between 0 and 1.• If r2 = 1:

– all data points fall perfectly on the regression line– the predictor x accounts for all of the variation in y

• If r2 = 0:– the fitted regression line is perfectly horizontal– the predictor x accounts for none of the variation in y

Page 7: Descriptive measures of the strength of a  linear association

Interpretation of r2

• r2 ×100 percent of the variation in y is reduced by taking into account predictor x.

• r2 ×100 percent of the variation in y is “explained by” the variation in predictor x.

Page 8: Descriptive measures of the strength of a  linear association

R-sq in Minitab fitted line plot

30 40 50

100

150

200

Latitude (at center of state)

Mo

rta

lity

Mort = 389.189 - 5.97764 Lat

S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 %

Regression Plot

Page 9: Descriptive measures of the strength of a  linear association

R-sq in Minitab regression output

The regression equation is Mort = 389.189 - 5.97764 Lat S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 %

Analysis of Variance

Source DF SS MS F PRegression 1 36464.2 36464.2 99.7968 0.000Error 47 17173.1 365.4 Total 48 53637.3

Page 10: Descriptive measures of the strength of a  linear association

Pearson correlation coefficient r

2rr • r is a (unitless) number between -1 and 1, inclusive.

• Sign of coefficient of correlation

– plus sign if slope of fitted regression line is positive

– negative sign if slope of fitted regression line is negative

If r2 is represented in decimal form, e.g. 0.39 or 0.87, then:

Page 11: Descriptive measures of the strength of a  linear association

Formulas for the Pearson correlation coefficient r

n

i

n

iii

n

iii

yyxx

yyxxr

1 1

22

1

1

1

2

1

2

b

yy

xx

rn

ii

n

ii

Page 12: Descriptive measures of the strength of a  linear association

What do we learn from the formulas for r?

• The correlation coefficient r gets its sign from the slope b1.

• The correlation coefficient r is a unitless measure.

• The correlation coefficient r = 0 when the estimated slope b1 = 0 and vice versa.

Page 13: Descriptive measures of the strength of a  linear association

Interpretation of Pearson correlation coefficient r

• There is no nice practical interpretation for r as there is for r2.

• r = -1 is perfect negative linear relationship.• r = 1 is perfect positive linear relationship.• r = 0 is no linear relationship.• For other r, how strong the relationship

between x and y is deemed depends on the research area.

Page 14: Descriptive measures of the strength of a  linear association

Pearson correlation coefficient r in Minitab

Correlations: Mort, Lat

Pearson correlation of Mort and Lat = -0.825

Correlations: Lat, Mort

Pearson correlation of Lat and Mort = -0.825

Page 15: Descriptive measures of the strength of a  linear association

How strong is the linear relationship between Celsius and Fahrenheit?

0 10 20 30 40 50

30

40

50

60

70

80

90

100

110

120

Celsius

Fa

hre

nhe

it

Fahrenheit = 32 + 1.8 Celsius

S = 0 R-Sq = 100.0 % R-Sq(adj) = 100.0 %

Regression Plot

Pearson correlation of Celsius and Fahrenheit = 1.000

Page 16: Descriptive measures of the strength of a  linear association

How strong is the linear relationship between # of stories and height?

105 95 85 75 65 55 45 35 25 15

1200

700

200

STORIES

HE

IGH

T

S = 58.3259 R-Sq = 90.4 % R-Sq(adj) = 90.2 %

HEIGHT = 90.3096 + 11.2924 STORIESRegression Plot

Pearson correlation of HEIGHT and STORIES = 0.951

Page 17: Descriptive measures of the strength of a  linear association

How strong is the linear relationship between driver age and see distance?

80706050403020

600

500

400

300

DrivAge

Dis

tanc

e

S = 49.7616 R-Sq = 64.2 % R-Sq(adj) = 62.9 %Distance = 576.682 - 3.00684 DrivAge

Regression Plot

Pearson correlation of Distance and DrivAge = -0.801

Page 18: Descriptive measures of the strength of a  linear association

How strong is the linear relationship between height and g.p.a.?

75706560

4

3

2

height

gpa

S = 0.542316 R-Sq = 0.3 % R-Sq(adj) = 0.0 %

gpa = 3.41021 - 0.0065630 height

Regression Plot

Pearson correlation of height and gpa = -0.053

Page 19: Descriptive measures of the strength of a  linear association

Caution #1

• The correlation coefficient r quantifies the strength of a linear relationship.

• It is possible to get r = 0 with a perfect curvilinear relationship.

Page 20: Descriptive measures of the strength of a  linear association

Example of Caution #1

5 0-5

40

30

20

10

0

x

y

S = 13.4907 R-Sq = 0.0 % R-Sq(adj) = 0.0 %

y = 14 - 0.0000000 xRegression Plot

Pearson correlation of x and y = 0.000

y

Page 21: Descriptive measures of the strength of a  linear association

Clarification of Caution #1

5 0-5

40

30

20

10

0

x

y

S = 0 R-Sq = 100.0 % R-Sq(adj) = 100.0 %y = 0.0000000 - 0.0000000 x + 1 x**2

Regression Plot

Pearson correlation of x and y = 0.000

Page 22: Descriptive measures of the strength of a  linear association

Caution #2

• A large r2 value should not be interpreted as meaning that the estimated regression line fits the data well.

• Another function might better describe the trend in the data.

Page 23: Descriptive measures of the strength of a  linear association

Example of Caution #2

200019001800

200

100

0

Year

US

Po

pula

tion

(mill

ions

)

S = 22.8349 R-Sq = 92.0 % R-Sq(adj) = 91.6 %

USPopn = -2217.46 + 1.21862 Year

Regression Plot

Pearson correlation of Year and USPopn = 0.959

Page 24: Descriptive measures of the strength of a  linear association

Caution #3

• The coefficient of determination r2 and the correlation coefficient r can both be greatly affected by just one data point (or a few data points).

Page 25: Descriptive measures of the strength of a  linear association

Example of Caution #3

6 7 8

0

100

200

300

400

500

Magnitude

De

ath

s

Deaths = -1121.94 + 179.468 Magnitude

S = 140.359 R-Sq = 53.5 % R-Sq(adj) = 41.9 %

Regression Plot

Pearson correlation of Deaths and Magnitude = 0.732

Page 26: Descriptive measures of the strength of a  linear association

Example of Caution #3

6.4 6.9 7.4

0

50

100

Magnitude

De

ath

s

Deaths = 647.967 - 87.1465 Magnitude

S = 13.1447 R-Sq = 92.1 % R-Sq(adj) = 89.4 %

Regression Plot

Pearson correlation of Deaths and Magnitude = -0.960

Page 27: Descriptive measures of the strength of a  linear association

Caution #4

• Correlation (association) does not imply causation.

Page 28: Descriptive measures of the strength of a  linear association

Example of Caution #4

9876543210

300

200

100

Wine consumption

Hea

rt d

isea

se d

eath

s

S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %

Heart = 260.563 - 22.9688 WineRegression Plot

Liters of wine per person per year

(per

100

,000

peo

ple)

Pearson correlation of Wine and Heart = -0.843

Page 29: Descriptive measures of the strength of a  linear association

Caution #5

• Ecological correlations are correlations that are based on rates or averages.

• Ecological correlations tend to overstate the strength of an association.

Page 30: Descriptive measures of the strength of a  linear association

Example of Caution #5

• Data from 1988 Current Population Survey• Treating individuals as the units

– Correlation between income and education for men age 25-64 in U.S. is r ≈ 0.4.

• Treating nine regions as the units– Compute average income and average education for

men age 25-64 in each of the nine regions.

– Correlation between the average incomes and the average education in U.S. is r ≈ 0.7.

Page 31: Descriptive measures of the strength of a  linear association

Example of Caution #5

9876543210

300

200

100

Wine consumption

Hea

rt d

isea

se d

eath

s

S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %

Heart = 260.563 - 22.9688 WineRegression Plot

Liters of wine per person per year

(per

100

,000

peo

ple)

Page 32: Descriptive measures of the strength of a  linear association

Example of Caution #5

30 40 50

100

150

200

Latitude (at center of state)

Mo

rta

lity

Mort = 389.189 - 5.97764 Lat

S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 %

Regression Plot

Page 33: Descriptive measures of the strength of a  linear association

Caution #6

• A “statistically significant” r2 does not imply that the slope β1 is meaningfully different from 0.

Page 34: Descriptive measures of the strength of a  linear association

Caution #7

• A large r2 does not necessarily mean that a useful prediction of the response ynew (or estimation of the mean response μY) can be made.

• It is still possible to get prediction (or confidence) intervals that are too wide to be useful.

Page 35: Descriptive measures of the strength of a  linear association

Using the sample correlation r to learn about

the population correlation ρ

Page 36: Descriptive measures of the strength of a  linear association

Translating a research question into a statistical procedure

• Is there a linear relationship between skin cancer mortality and latitude?– t-test for testing H0: β1= 0

– ANOVA F-test for testing H0: β1= 0

• Is there a linear correlation between husband’s age and wife’s age?– t-test for testing population correlation

coefficient H0: ρ = 0

Page 37: Descriptive measures of the strength of a  linear association

Where does this topic fit in?

• Model formulation

• Model estimation

• Model evaluation

• Model use

Page 38: Descriptive measures of the strength of a  linear association

Is there a linear correlation between husband’s age and wife’s age?

655545352515

65

60

55

50

45

40

35

30

25

20

Wife's Age (years)

Hus

band

's A

ge (

year

s)

Pearson correlation of HAge and WAge = 0.939

Page 39: Descriptive measures of the strength of a  linear association

Is there a linear correlation between husband’s age and wife’s age?

65605550454035302520

65

55

45

35

25

15

Husband's Age (years)

Wife

's A

ge (

year

s)

Pearson correlation of WAge and HAge = 0.939

Page 40: Descriptive measures of the strength of a  linear association

The formal t-test for correlation coefficient ρ

Null hypothesis H0: ρ = 0Alternative hypothesis HA: ρ ≠ 0 or ρ < 0 or ρ > 0

Test statistic2

*

1

2

r

nrt

P-value = What is the probability that we’d get a t* statistic as extreme as we did, if the null hypothesis is true?

The P-value is determined by comparing t* to a t distribution with n-2 degrees of freedom.

Page 41: Descriptive measures of the strength of a  linear association

Is there a linear correlation between husband’s age and wife’s age?

Test statistic:

39.35939.01

2170939.0

1

222

*

r

nrt

Student's t distribution with 168 DF x P( X <= x ) 35.3900 1.0000

Help in determining the P-value:

Just let Minitab do the work:

Pearson correlation of WAge and HAge = 0.939P-Value = 0.000

Page 42: Descriptive measures of the strength of a  linear association

When is it okay to use the t-test for testing H0: ρ = 0?

• When it is not obvious which variable is the response.

• When the (x, y) pairs are a random sample from a bivariate normal population.– For each x, the y’s are normal with equal variances. – For each y, the x’s are normal with equal variances.– Either, y can be considered a linear function of x.– Or, x can be considered a linear function of y.

• The (x, y) pairs are independent.

Page 43: Descriptive measures of the strength of a  linear association

The three tests will always yield similar results.

Pearson correlation of WAge and HAge = 0.939P-Value = 0.000

The regression equation is HAge = 3.59 + 0.967 Wage170 cases used 48 cases contain missing values

Predictor Coef SE Coef T PConstant 3.590 1.159 3.10 0.002WAge 0.96670 0.02742 35.25 0.000

S = 4.069 R-Sq = 88.1% R-Sq(adj) = 88.0%

Analysis of VarianceSource DF SS MS F PRegression 1 20577 20577 1242.51 0.000Error 168 2782 17Total 169 23359

Page 44: Descriptive measures of the strength of a  linear association

The three tests will always yield similar results.

The regression equation is WAge = 1.57 + 0.911 HAge170 cases used 48 cases contain missing values

Predictor Coef SE Coef T PConstant 1.574 1.150 1.37 0.173HAge 0.91124 0.02585 35.25 0.000

S = 3.951 R-Sq = 88.1% R-Sq(adj) = 88.0%

Analysis of VarianceSource DF SS MS F PRegression 1 19396 19396 1242.51 0.000Error 168 2623 16Total 169 22019

Pearson correlation of WAge and HAge = 0.939P-Value = 0.000

Page 45: Descriptive measures of the strength of a  linear association

Which results should I report?

• If one of the variables can be clearly identified as the response, report the t-test or F-test results for testing H0: β1 = 0.– Does it make sense to use x to predict y?

• If it is not obvious which variable is the response, report the t-test results for testing H0: ρ = 0.– Does it only make sense to look for an association

between x and y?