Copyright (c)Bani K. Mallick 4
Lecture 20 Review: Leverage and Outliers
Outliers in Linear Regression are difficult to diagnose
They depend crucially on where X is
* **
*
*
A boxplot of Y would think this is an outlier, when in reality it fits the line quite well
Copyright (c)Bani K. Mallick 5
Lecture 20 Review: Outliers and Leverage
It’s also the case than one observation can have a dramatic impact on the fit
**
**
*
*
*
The slope of the line depends crucially on the value far to the right
Copyright (c)Bani K. Mallick 6
Lecture 20 Review: Outliers and Leverage
But Outliers can occur
*
**
**
*
*
This point is simply too high for its value of XLine with Outlier
Line without Outlier
Copyright (c)Bani K. Mallick 7
Lecture 20 Review: Outliers and Leverage
A leverage point is an observation with a value of X that is outlying among the X values
An outlier is an observation of Y that seems not to agree with the main trend of the data
Outliers and leverage values can distort the fitted least squares line
It is thus important to have diagnostics to detect when disaster might strike
Copyright (c)Bani K. Mallick 8
Lecture 20 Review: Outliers and Leverage
We have three methods for diagnosing high leverage values and outliers
Leverage plots: For a single X, these are basically the same as boxplots of the X-space (leverage)
Cook’s distance (measures how much the fitted line changes if the observation is deleted)
Residual Plots
Copyright (c)Bani K. Mallick 9
Correlation and Measures of Fit
You all know the word “correlation”, as in “Height and Weight are positively correlated”
Many of you may also have heard of R-squared denoted by R2
Both are measures of how well an independent variable predicts a dependent variable
Copyright (c)Bani K. Mallick 10
Correlation and Measures of Fit
R2 measures the fraction of variance explained by the least squares line
The relevant sums of squares are
The fraction of the total sum of squares explained by the fitted line is
n 2
ii 1
Sum of Squares Total SST Y Y
n 2
i ii 1
ˆSum of Squares Residual SSR Y Y
2 SST SSRR
SST
Copyright (c)Bani K. Mallick 11
Correlation and Measures of Fit
R2 measures the fraction of variance explained by the least squares line
If Y and X are perfectly linearly related, then all the variation in Y is explained by the line, and thus R2 = 1
If Y and X are completely independent, then the line explains nothing about Y, so R2 = 0
However, Y and X can be perfectly related but not linearly, and R2 is misleading in this case (see later on)
Copyright (c)Bani K. Mallick 12
GPA and Height
Height in inches
807570656055
Gra
de
Po
int
Ave
rag
e (
GP
A)
4.5
4.0
3.5
3.0
2.5
2.0
1.5
Note that this is a fairly weak relationship, so little variance explained: suggests R-squared is near zero
Copyright (c)Bani K. Mallick 13
GPA and Height
Model Summary: GPA regression on Height
.273a .074 .065 .4979Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), Height in inchesa.
ANOVAb
1.954 1 1.954 7.881 .006a
24.294 98 .248
26.247 99
Regression
Residual
Total
Model1
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), Height in inchesa.
Dependent Variable: Grade Point Average (GPA)b.
Copyright (c)Bani K. Mallick 14
Aortic Valve Area and Body Surface Area
Note that this is a stronger relationship: suggests R-squared is higher
Healthy Children, Stenosis Data
Body Surface Area
2.52.01.51.0.50.0
Ao
rtic
Va
lve
Are
a
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
-.5
Copyright (c)Bani K. Mallick 15
AVA and BSA in Healthy Kids
Model Summary: AVA Regression on BSA in Healthy Kids
.866a .750 .747 .39337Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), Body Surface Areaa.
ANOVAb
31.640 1 31.640 204.474 .000a
10.522 68 .155
42.162 69
Regression
Residual
Total
Model1
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), Body Surface Areaa.
Dependent Variable: Aortic Valve Areab.
Copyright (c)Bani K. Mallick 16
Correlation and Measures of Fit
The (Pearson) correlation coefficient measures how well Y and X are linearly related
The correlation is always between –1 and +1
2xy
2xy
xy
r R ifl east squares slope 0
r R ifl east squares slope 0
r 0 ifl east squares slope 0
Copyright (c)Bani K. Mallick 17
Correlation and Measures of Fit
If the correlation = +1, then Y and X are perfectly positively related
If the correlation = -1, then Y and X are perfectly negatively related
If the correlation = 0, then Y and X are not linearly related
2xy
2xy
xy
r R ifl east squares slope 0
r R ifl east squares slope 0
r 0 ifl east squares slope 0
Copyright (c)Bani K. Mallick 18
Correlation and Measures of Fit
The (Spearman) correlation coefficient measures how well Y and X are monotonically related
Replace Y by its rank among the Y’s
Replace X by its rank among the X’s
Computer the (Pearson) correlation
Why would someone do a Spearman correlation?
Copyright (c)Bani K. Mallick 19
Correlation and Measures of Fit
The (Spearman) correlation coefficient measures how well Y and X are monotonically related
Replace Y by its rank among the Y’s
Replace X by its rank among the X’s
Computer the (Pearson) correlation
Why would someone do a Spearman correlation? Because it is more robust to outliers, and it is not affected by transformations
Copyright (c)Bani K. Mallick 20
Correlation and Measures of Fit
Both types of correlations are easily obtained in SPSS
Go to “Analyze”, “Correlation” and type in all the variables that you want correlations for
You have to click on Spearman to get it, otherwise you get only Pearson
Confidence intervals for the population correlations are not included
SPSS Demonstration using aortic data
Copyright (c)Bani K. Mallick 21
Correlation and Measures of Fit
The diagonals are meaningless: Y is perfectly correlated with Y
Correlations
1.000 .866 ** .873 **
. .000 .000
70 70 70
.866 ** 1.000 .982 **
.000 . .000
70 70 70
.873 ** .982 ** 1.000
.000 .000 .
70 70 70
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Body Surface Area
Aortic Valve Area
Log(1+Aortic Valve Area)
Body SurfaceArea
AorticValve Area
Log(1+AorticValve Area)
Correlation is significant at the 0.01 level (2-tailed).**.
Copyright (c)Bani K. Mallick 22
Correlation and Measures of Fit
Note how the Spearman correlation of BSA and AVA is the same as the Spearman correlation of bSA and log(1+AVA)
Correlations
1.000 .883** .883**
. .000 .000
70 70 70
.883** 1.000 1.000**
.000 . .
70 70 70
.883** 1.000** 1.000
.000 . .
70 70 70
Correlation Coefficient
Sig. (2-tailed)
N
Correlation Coefficient
Sig. (2-tailed)
N
Correlation Coefficient
Sig. (2-tailed)
N
Body Surface Area
Aortic Valve Area
Log(1+Aortic Valve Area)
Spearman's rho
Body SurfaceArea
AorticValve Area
Log(1+AorticValve Area)
Correlation is significant at the .01 level (2-tailed).**.
Copyright (c)Bani K. Mallick 23
Correlation and Measures of Fit
Both correlations are random variables, i.e., if you redo an experiment, you will get a different Pearson correlation
The population Pearson correlation is
The estimated standard error of the sample Pearson correlation is
2xy
xy
1 rs.e.(r )
n 2
xy
Copyright (c)Bani K. Mallick 24
Correlation and Measures of Fit
Null hypothesis of no linear relationship:
A (1100% CI for the population Pearson correlation is
Since the population correlation must be between –1 and +1, you should restrict your interval to that range: reject null if interval does not include 0
2xy
xy / 2
1 rr z
n 2
0 xyH : 0
Copyright (c)Bani K. Mallick 25
Correlation and Measures of Fit
Consider the aortic stenosis healthy kids
n = 70, Pearson correlation = 0.866
The 95% CI is
What is the meaning of this interval?
2xy
xy 1 / 2
1 rr z
n 2
21 0.8660.866 1.96
70 20.866 0.119
.747 to .985
Copyright (c)Bani K. Mallick 26
Correlation and Measures of Fit
Consider the aortic stenosis healthy kids
n = 70, Pearson correlation = 0.866
The 95% CI is
What is the meaning of this interval? 95% certain that the population Pearson correlation is between .747 and .985
2xy
xy 1 / 2
1 rr z
n 2
21 0.8660.866 1.96
70 20.866 0.119
.747 to .985
Copyright (c)Bani K. Mallick 27
Some Warnings About Correlation
The Pearson correlation can be greatly affected by outliers and leverage values
This is why it is good to have the Spearman
Copyright (c)Bani K. Mallick 28
Aortic Stenosis Data: Note the outlier in the Stenotic Kids
Linear Regression
0.000 0.500 1.000 1.500 2.000
Body Surface Area
0.000
1.000
2.000
3.000
Ao
rtic
Val
ve A
rea
Healthy Stenoti
0.000 0.500 1.000 1.500 2.000
Body Surface Area
Aortic Stenosis Data
Copyright (c)Bani K. Mallick 29
Some Warnings About Correlation
The Pearson correlation with the outlier in the Stenotic kids is 0.477
It is 0.648 without the outlier
The Spearman correlations are 0.691 and 0.762 with and without the outlier
I can make correlations dance
Copyright (c)Bani K. Mallick 30
Some Warnings About Correlation
The correlations are 0.058 (left) and 1.00 (right)! Only one point differs: high leverage outlier
Linear Regression
-10.00 -7.50 -5.00 -2.50 0.00
x
-10.00
-5.00
0.00
5.00
10.00
Ou
tlie
r ad
ded
1.00 2.00
-10.00 -7.50 -5.00 -2.50 0.00
x
Made Up Data with (left) and without (right) a high leverage outlier
Copyright (c)Bani K. Mallick 31
Some Warnings About Correlation
The Pearson correlation only measures linear correlation
If your relationship is not linear, then Pearson will get confused
Copyright (c)Bani K. Mallick 32
Some Warnings About Correlation
Note the perfect quadratic relationship
Pearson corr = 0
X
20100-10-20
Y q
ua
dra
tic in
X
120
100
80
60
40
20
0
-20
Copyright (c)Bani K. Mallick 33
Construction Data
20.00 30.00 40.00 50.00
Age (Modified)
8.00
9.00
10.00
11.00
Copyright (c)Bani K. Mallick 34
Construction Data
Correlations
1.000 .178** .120*
. .000 .011
447 447 447
.178** 1.000 .896**
.000 . .000
447 447 447
.120* .896** 1.000
.011 .000 .
447 447 447
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Age (Modified)
Base Pay (modified)
Log(Base Paymodified - $30,000)
Age (Modified)Base Pay
(modified)
Log(BasePay modified
- $30,000)
Correlation is significant at the 0.01 level (2-tailed).**.
Correlation is significant at the 0.05 level (2-tailed).*.
Copyright (c)Bani K. Mallick 35
Construction Data
Correlations
1.000 .151** .151**
. .001 .001
447 447 447
.151** 1.000 1.000**
.001 . .
447 447 447
.151** 1.000** 1.000
.001 . .
447 447 447
Correlation Coefficient
Sig. (2-tailed)
N
Correlation Coefficient
Sig. (2-tailed)
N
Correlation Coefficient
Sig. (2-tailed)
N
Age (Modified)
Base Pay (modified)
Log(Base Paymodified - $30,000)
Spearman's rhoAge (Modified)
Base Pay(modified)
Log(BasePay modified
- $30,000)
Correlation is significant at the .01 level (2-tailed).**.
Copyright (c)Bani K. Mallick 36
Armspan Data (Males)
67.50 70.00 72.50 75.00 77.50
Height (modified)
65.00
70.00
75.00
80.00
Armspan (modified) = -9.16 + 1.13 * modhR-Square = 0.73
Height and Armspan for Males
Copyright (c)Bani K. Mallick 37
Armspan Data (Males)
Correlations
1.000 .855**
. .000
50 50
.855** 1.000
.000 .
50 50
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Height (modified)
Armspan (modified)
Height(modified)
Armspan(modified)
Correlation is significant at the 0.01 level (2-tailed).**.
Copyright (c)Bani K. Mallick 38
Armspan Data (Males)
A 95% confidence interval for the population Pearson correlation is
Meaning?
2xy
xy
2
1 rr 1.96
n 2
1 0.8550.855 1.96
50 2
0.855 1.96 0.0056
0.855 0.146
0.709,1.00
Copyright (c)Bani K. Mallick 39
Armspan Data (Males)
Correlations
1.000 .813**
. .000
50 50
.813** 1.000
.000 .
50 50
Correlation Coefficient
Sig. (2-tailed)
N
Correlation Coefficient
Sig. (2-tailed)
N
Height (modified)
Armspan (modified)
Spearman's rho
Height(modified)
Armspan(modified)
Correlation is significant at the .01 level (2-tailed).**.