1 Topic 5 – Partial Correlations; Diagnostics & Remedial Measures Chapters 10 & 14

1

Topic 5 – Partial Correlations; Diagnostics & Remedial Measures

Chapters 10 & 14

2

Overview

Review: MLR Tests & Extra SS

Partial Correlations – Think “Extra” SS being used to compute “Extra” R2

Of the variation left to explain...How much is explained by adding another group of variables?

Diagnostics & Remedial Measures

We’ve already talked a lot about this. So...some review, with a few additional points to be made.

3

Review (Tests)

1. ANOVA F Test: Does the group of predictor variables explain a significant percentage of the variation in the response?

2. Variable Added Last T-tests: Does a given variable explain a significant part of the variation remaining after all other variables have been included in the model?

3. Partial F Tests: Does a group of variables explain significant variation in the response over and above that already explained by other variables already in the model?

4

Review (SAS)

Type I SS is sequential sums of squares:

Type II (or III) SS is extra sums of squares for variables added last:

For X2,

1

2 1

3 1 2

4 1 2 3

|

| ,

| , ,

...

SS X

SS X X

SS X X X

SS X X X X

2 1 3 4| , , ,SS X X X X etc

5

Review (ESS)

Type I SS are additive and sum to SSR. If the variables are added in the proper order, any ESS may be computed using the Type I sums of squares.

Type II SS are not additive and can only be used to assess the contribution of an individual variable over and above the rest of the predictors in the model.

In a lot of ways, the Type II SS are giving the same information that variable added last t-tests gave.

The presentation is in SS instead of sig. tests.

6

Review

SS(X1)

SS(X3|X1,X2)

Total SS

SS(X2|X1)

7

Review (ESS)

Extra Sums of Squares is SS due to the added group of variables over and above any variables (given) already in the model.

For Example:

2 1 2 1 1

2 3 1 1 2 3 1

| ,

, | , ,

SS X X SS X X SS X

SS X X X SS X X X SS X

8

Review (General Linear Test)

The test is performed by comparing variances between the full and reduced models.

F-statistic is based on SS:

This is EXTRA SS for the added variables

/SSE reduced SSE full kF

MSE full

9

Review: Hypotheses

Test comparing two models (null model subset of full model):

Above is same as

Rejecting means at least one variable in the “added group” is important.

0 0 1 1

0 1 1 2 2 3 3

:

:i

a i

H Y X

H Y X X X

0 2 3: 0H

10

Correlations: Multiple, Partial

Using Multiple & Partial Correlations in Multiple Regression Analysis (Chapter 10)

11

Simple Correlation

In a SLR, the correlation coefficient describes the “strength” and “direction” of the linear relationship.

Problem: Only used in MLR to identify collinearity or multi-collinearity PROC CORR provides a matrix of simple

correlations among all in a data set.

Single correlations > 0.9 collinear

Several > 0.5 multicollinearity

yxr

12

Review (R2)

R2 is much more useful in MLR

Called the coefficient of determination

Computationally:

Conceptually: describes the percent of variation in the response that is explained by the predictor variables.

2 1SSR SSE

RSST SST

13

Other Types of Correlation

Multiple Correlation: assess the strength of the relationship between Y and a set of predictors.

Partial Correlation: assess the relationship between Y and a single X after first putting into the model an initial set of Z’s (i.e. old X’s).

Multiple Partial Correlation: assess the relationship between Y and an additional set of predictors (X’s) after adjusting for initial predictors (Z’s).

14

Multiple Correlations

My notation deviates from text – but is easier to understand:

is interpreted as the percentage of variation in Y (as represented by SST) that is explained by

This number is called the multiple coefficient of determination for the model. Values close to 1 imply greater “strength” in terms of the linear association.

1 2

2, ,..., kY X X XR

1 2

2, ,..., kY X X XR

1 2, , ...,

kX X X

15

Formula: MCD / MCC

Recall: Better fit if do a better job than in terms of prediction and estimation. SSR compared to SST measures “how much better?”

Note: More complex formulas in text (p163); we would not use these in practice. Additionally, we seldom refer to R, usually using R2.

ˆ 'Y s Y

1 2

1 2 1 2

2| , ,...,

2| , ,..., | , ,...,

k

k k

Y X X X

Y X X X Y X X X

SSRR

SST

R R

16

Uses of

The Coefficient of Determination can be used to compare different fits.

General idea

1. Higher R2 is better. But R2 always increases with an additional predictor.

2. Fewer predictors is better, so if R2 doesn’t go up by “much” it is usually better to leave the predictor out.

Combine the above two idea Adjusted R2

is a reasonable compromise (which we will discuss next week).

1 2

2| , ,..., kY X X XR

17

Partial Correlations

Notation:

Think: are the “original predictors already in model” and now we want to add X to see what happens. is the part of SST which is left to explain after the Z’s have been used.

1 2

2| , ,..., kYX Z Z Zr

1 2

1 22| , ,...,

1 2

| , ,...,

, ,...,p

kYX Z Z Z

k

SS X Z Z Zr

SSE Z Z Z

1 2, , , kZ Z Z

1 2, , , kSSE Z Z Z

18

Partial Correlations (2)

Conceptually: The partial coefficient of determination represents the percentage of “remaining variation” in the response (after the initial group of predictors has been incorporated into the model) that is explained by the “added” predictor.

Numerous formulas are presented in Section 10.5 of the text; however a good conceptual understanding and the ability to use the sums of squares for computation is all that is necessary in this class.

19

Multiple Partial Correlations

This is simply an extension of partial correlation; now instead of adding a single X we add multiple predictors .

Conceptually: The multiple partial coefficient of determination represents the percentage of “remaining variation” in the response (after the initial group of predictors has been used) that is explained by the “added” group of predictors.

1 2 1 2

1 2 1 22

, ,..., | , ,...,1 2

, ,..., | , ,...,

, ,...,p k

p k

Y X X X Z Z Zk

SS X X X Z Z ZR

SSE Z Z Z

1 2, ,..., pX X X

20

Spurious Correlation

A correlation between two variables X & Y is called spurious if the two variables are correlated (indirectly) only because of their (direct) correlation to a lurking variable or group of variables. e.g. Corn yield is perhaps correlated to water level

in the city reservoir. This is only true because both are related directly to the current weather pattern.

See Problem 10.3 in the text.

Correlation does not imply causation!

21

Obtaining values from SAS

In the model statement of proc reg, there are two options: ‘pcorr1’ and ‘pcorr2’.

pcorr1 gives (in order of entry of predictor variables)

pcorr2 gives partial correlations for the variable added last

e.g.

1 2 1 3 1 2

2 2 2| | ,, , , .YX YX X YX X Xr r r etc

3 1 2 4 5

2| , , ,YX X X X Xr

22

Example: SBP

proc reg; model sbp = size age smk /ss1 ss2 pcorr1 pcorr2; model sbp = smk size age /ss1 ss2 pcorr1 pcorr2; Squared Squared Partial Partial Variable DF Type I SS Type II SS Corr Type I Corr Type II Intercept 1 668457 963.09739 . . Size 1 3537.94574 200.14147 0.55057 0.11527 Age 1 582.64651 769.45920 0.20175 0.33373 Smk 1 769.23345 769.23345 0.33367 0.33367 Variable DF Type I SS Type II SS Corr Type I Corr Type II Intercept 1 668457 963.09739 . . Smk 1 393.09816 769.23345 0.06117 0.33367 Size 1 3727.26833 200.14147 0.61783 0.11527 Age 1 769.45920 769.45920 0.33373 0.33373

23

Sample Computations

SST = 6425; SSE = 1536 (from SAS)

2

|

| 5830.202

6425 3538Y age size

SS age sizer

SST SS size

2

| ,

| , 7690.334

| , 1536 769Y age size smk

SS age size smkr

SSE SS age size smk

24

CLG Activity

In CLG’s, please attempt Activity #1 from the handout for Topic 5. Note that this activity builds on the one we used for Topic #4.

25

Regression Diagnostics & Remedial Measures

Diagnosing and Fixing Problems (Chapter 14)

26

Overview

Assumptions: What problems can occur?

Diagnostics: How do we find the problems if they exist?

27

Potential Problems (1)

Errors in data entry (these would initially show up as outliers)

Very simple issue, you might be surprised at how often it occurs.

Examples

10.07 gets mistyped as 1007

Some data recorded in centimeters, some recorded in meters.

28

GREEN = Correct DataRED = Data with 1.9 mis-recorded as 19

29


Data not from the same population

It can occur that experimental units are so vastly different that we cannot “account” for those differences using predictor variables.

Example

Huge cities (Chicago, Indy) may not exhibit the same behavior as smaller cities (Lafayette, Ft. Wayne) even after we “adjust” for population size.

30

The point at X = 50 does not fit the same pattern; it is from a different (unusual) population.

31


Violations of basic model assumptions

Assumptions on errors

Assumed (chosen) parametric form

Multicollinearity

32

Review of Assumptions

1. Basic Assumptions on ERRORS

a. Independent errors (and observations)

b. Normally distributed errors

c. Constant error variance

2. Model validity – the model is not missing any important terms or predictors

33

Violations of Assumptions—Examples

If the response variable is a “count”

Errors may not be normal.

Results are more widely “varied” as age increases

Errors have non-constant variance

Survey data is obtained using an person to ask the questions. This person gets better at the task with experience.

Data may not be independent.

34

Regular Residuals

Assessment for normality, constancy of variance, and linearity are all based on analysis of the residuals (usually just looking at appropriate plots).

Average residual is .

Sum of squared residuals is .

ˆ i i iY Y e

observed predicted residual

0e 2ie SSE

35

Diagnostic Plots

We generally know what these plots will look like. So this section will just present some pictures depicting violations of assumptions.

36

Residual Plot

Plot residuals vs. predicted values

Curvilinear pattern indicates need for additional terms in the model (quadratic, log).

Increasing or decreasing spread (megaphone) pattern indicates non-constant variance.

If problems are indicated, you can plot residuals vs. specific predictors to find the predictor at the root of the problem.

37

Other plots using residuals

Plot vs. normal quantiles

UNIVARIATE procedure (QQPLOT statement)

If plot is linear, then normality is probably satisfied. If close to linear, then results probably still valid (robustness).

Plot Residuals vs. order of observation

Pattern suggests that observations are not independent (changing over time)

38

Review of Diagnostic Plots Normality

QQ Plot of residuals against normal quantiles should show linearity

Constancy of Variance

Residual Plot (residuals vs. predicted)—look for pattern in the vertical spread – see page 225

Linearity

Residual Plot (residuals vs. predicted)—any “shape” pattern, like a curve, may indicate that the terms of the model should be changed.

39

Diagnostic Plots (2)

Independence

Can plot residuals over time, observation number. Plot should show NO patterns. Better to make sure data collection doesn’t violate independence (eg use an SRS)

Outliers

These will show up in plots as well! Usually quite obvious.

40

Residual Plot—Residuals vs Predicted (good)

41

QQplot—Normality (good)

42

Residual Plot— Violation: Linearity

43

Residual Plot— Violation: Constant Variance

44

QQplot—Problem: Outlier

45

Residual Plot— Problem: Outlier

46

QQplot—Problem: Discrete Errors (count data)

47

QQplot—Problem: Heavy Tails (not normal)

- 4 - 3 - 2 - 1 0 1 2 3 4

- 100000

- 50000

0

50000

100000

150000

Residual

Nor mal Quant i l es

48

Further Residual Analysis

As we’ve seen, our assumptions may usually be assessed by looking at residuals.

There are actually two types of residuals! Regular residuals can be used in standard

residual plots to check three basic assumptions on errors

Studentized residuals (residuals scaled relative to their standard deviations) can be used to make stronger assessments, such as whether a particular data point should be classified as an outlier.

49

Regular Residuals are random variables (if we take new data,

we’ll get new and different residuals).

Residuals all have different standard deviations – SAS will calculate these.

Furthermore, and they are NOT independent.

From a technical standpoint these issues make it difficult to apply statistical methods. However, for sample sizes (n) large relative to the number of variables (k), this technicality can in large part be ignored. This is why standard residual plots are useful.

ie

50

Studentized Residuals

The studentized residual is defined as

Without going into mathematical detail, the quantity is known as the leverage of observation i. It is a measure of the importance of observation i in determining the model fit.

The studentized residuals approximate a t-distribution with n – k – 1 (error) degrees of freedom if the underlying assumptions are true.

1i

i

i

er

h MSE

0 1ih

51

Different Types of Residuals

Book also discusses studentized deleted residuals (observation is not used in the computation of its studentized residual).

Can make all the residual plots using any kind of residuals!

Note: All types of residuals look similar when assumptions are met, hence plots can be read the same way

Only the regular residuals are subject to a change in units. So when problems occur it might look more obvious if we plot the studentized (or deleted) residuals.

52

Influence Statistics

In addition to plots, as we mentioned in the SLR topic there are also some statistical tests which can be used to check for problems.

We will discuss a few tests that are good for “classifying” points as outliers.

53

Identifying Outliers in Y

The studentized residuals have a T-distribution with DF equal to the error DF

“Student Residual” in SAS

We can use t-tests to find abnormal values but we actually need to test n residuals so a bonferroni adjustment is needed.

Any studentized residual with magnitude greater than the CV is classified as an outlier in Y. (See Table A-8B)

54

Identifying Outliers in X

High leverage values (hi) or indicate that a data point has a lot of influence.

The reason for the influence is usually that the data point is near the “edge” of the model’s scope and “pulls” the line

“Hat Diag H” in SAS

A leverage that is larger than the CV indicates that a point has a lot of general influence on the regression (See Table A-9)

55

Influence on all the slopes

Cook’s Distance measures the extent to which the regression coefficients (slopes) change when the ith observation is deleted.

Outliers in X may also have high Cook’s D.

“Cooks D” in SAS

A Cooks Distance that is larger than the critical value has unnatural influence on the regression coefficients (See Table A-10, need to divide table value by n-k-1)

56

Influence on individual slopes DFBetas for a given predictor variable

measure the influence of an observation on that predictor’s parameter estimate.

A high DFBeta occurs when the value of the predictor is outlying for a given observation.

“DFBETAS” in SAS

Any DFBeta > 1 (general cutoff) indicates that the observation has unnatural influence on THAT particular slope

57

Other Measures (2)

DFFits measure the influence of an observation on its own fitted (predicted) value

This is less critical – but again outliers in the predictors have larger values

“DFFITS” in SAS

Any DFFit > 1 (general cutoff) indicates that the observation is predicting itself too much!

58

Influence statistics in SAS

Use the “r” and “influence” options in the model statement of PROC REG.

Also gives:

PRESS (predicted residual SS) is the sum of squared studentized deleted residuals. A well fit model will have PRESS close to the value of SSE.

If the SSE and PRESS and fairly different, likely there are influential values somewhere and we need to hunt them down!

59

SBP (output using /r)

Dependent Predicted Std Error Std Error Obs Variable Value Mean Predict Residual Residual 8 160.0000 144.2950 2.8509 15.7050 6.836 9 144.0000 128.7551 3.4820 15.2449 6.537 10 180.0000 172.5057 3.8129 7.4943 6.350 11 166.0000 159.9118 2.2406 6.0882 7.060 12 138.0000 151.5420 3.6422 -13.5420 6.450 Student Cook's Obs Residual -2-1 0 1 2 D 8 2.297 | |**** | 0.229 9 2.332 | |**** | 0.386 10 1.180 | |** | 0.126 11 0.862 | |* | 0.019 12 -2.100 | ****| | 0.351

“Usual”

60

SBP (output using /influence) Hat Diag Cov Obs Residual RStudent H Ratio DFFITS 11 6.0882 0.8583 0.0915 1.1431 0.2724 12 -13.5420 -2.2462 0.2418 0.7687 -1.2685 13 -6.0834 -0.8786 0.1332 1.1920 -0.3444 14 -6.5753 -0.9190 0.0720 1.1019 -0.2560 -------------------DFBETAS------------------- Obs Intercept Size Age Smk 11 -0.1586 0.0463 0.0559 0.1663 12 0.0690 -1.0997 0.9370 -0.3349 13 0.1961 -0.0432 -0.1082 0.1451 14 0.0061 -0.0548 0.0183 0.1751

Sum of Squared Residuals 1536.14305 Predicted Residual SS (PRESS) 2170.72371

DELETED

61

Collaborative Learning Activity

If we have time, we’ll continue with Activity #2 from the handout. If not, you may wish to look at it outside of class.

62

Questions?

63

Upcoming in Topic 6...

Model Selection (Chapters 9 & 16)

Documents

1 Topic 5 – Partial Correlations; Diagnostics & Remedial Measures Chapters 10 & 14