15
Outliers and Influential Data Points in Regression Analysis James P. Stevens sujin jang november 10, 2008

Outliers and Influential Data Points in Regression Analysis James P. Stevens sujin jang november 10, 2008

  • View
    227

  • Download
    5

Embed Size (px)

Citation preview

Outliers and Influential Data Points in Regression Analysis

James P. Stevens

sujin jangnovember 10, 2008

Beware of Outliers

• Regression is sensitive to outliers– Important to detect outliers and influential points

• Summary stats can be misleading…– Important to explore the data, rather than relying

on just 1-2 summary stats

Look at your Data!

– For all three plots, r, means, and SD are equal

But it’s not enough to look…

So what should we do?

• Ways of Detecting Outliers:– Studentized residuals for outliers on y– Mahalanobis distance &Hat matrix for outliers in

the space of predictors

Types of Outliers• Classifying Outliers:

- Outliers in the space of outcomes (outliers on y)- Outliers in the space of predictors (outliers on x)

So what should we do?

• Ways of Detecting Outliers:– Studentized residuals for outliers on y– Mahalanobis distance &Hat matrix for outliers in

the space of predictors

So what should we do?

• Ways of Detecting Outliers:– Studentized residuals for outliers on y– Mahalanobis distance &Hat matrix for outliers in

the space of predictors

BUT…The points they identify will not necessarily be influential in affecting the regression coefficients…

Outliers and Influential Points

outliers

influentialpoints

Example: Influential Points

Non-influential

Influential

Cook’s Distance:Identifying Influential Points

• A measure of the change in the regression coefficients that would occur if the case was omitted. – Affected by both the case being an outlier on y and in

the set of predictors – Measures the joint (combined) influence on the case

being an outlier on y and on x

Now what?

Step 1. Detect Step 2. IsolateStep 3. Examine

-Are they qualitatively different?-Are they influential?Another thing to consider:

influential “clusters”?

Example: Groups of Cases

Now what?

Step 1. Detect Step 2. IsolateStep 3. Examine

-Are they qualitatively different?-Are they influential?

Step 4. Delete or retain as you see fit … Or try both

The End