Download ppt - KNN Ch. 3 Diagnostics and Remedial Measures Applied Regression Analysis BUSI 6220

KNN Ch. 3Diagnostics and Remedial Measures

Applied Regression AnalysisBUSI 6220

Diagnostics for the Predictor Variable

Dot Plots Sequence Plots Stem-and-Leaf Plots

Essentially to check for outlying observations which will be useful in later diagnosis.

Residual AnalysisWhy Look at the Residuals?

Detect non-linearity of regression functionDetect Heteroscedasticity (=lack of constant variance)Auto-correlationOutliersNon-normality Important predictor variables left out?

Regression Model Assumptions:Errors are Independent (Have Zero Covariance)Errors have Constant VarianceErrors are Normally Distributed

Diagnostics for Residuals

Detect non-linearity of regression function

Heteroscedasticity Auto-correlation Outliers Non-normality

Important predictor variables left out?

1. against predictor (if X1 only)

2. (Absolute or Sqd. Residual) against predictor

3. against fitted values (for many Xi)

4. against time5. against omitted predictor variables6. Box plot7. Normal probability plot

PLOT OF PLOT OF RESIDUALSRESIDUALS

Diagnostics for Residuals

Approximate expected value of kth smallest residual :

25.0375.0

nkzMSE

Normal probability plotNormal probability plot

Tests involving Residuals

The Correlation test for NormalityThe Correlation test for Normality

HH00: The residuals are normal: The residuals are normal HHAA: The residuals are not normal: The residuals are not normalCorrelation between eCorrelation between eii(s) and their expected values (s) and their expected values under normality.under normality.Use Table B.6Use Table B.6Observed coeff. of correlation should be at least Observed coeff. of correlation should be at least as large as table value for a given level of as large as table value for a given level of significance.significance.


Other tests for NormalityOther tests for Normality

HH00: The residuals are normal: The residuals are normal HHAA: The residuals are not normal: The residuals are not normalAnderson-Darling (very powerful, may be used for Anderson-Darling (very powerful, may be used for small sets, n<25)small sets, n<25)Ryan-JoinerRyan-JoinerShapiro-WilkShapiro-WilkKolmogorov-SmirovKolmogorov-Smirov


The Correlation test for NormalityThe Correlation test for Normality

HH00: The residuals are normal: The residuals are normal HHAA: The residuals are not normal: The residuals are not normalCorrelation between eCorrelation between eii(s) and their expected values (s) and their expected values under normality.under normality.Use Table B.6Use Table B.6Observed coeff. of correlation should be at least Observed coeff. of correlation should be at least as large as table value for a given level of as large as table value for a given level of significance.significance.

Tests involving Residuals(Constancy of Error Variance)(Constancy of Error Variance)

The Modified Levene TestThe Modified Levene Test

Partitions the independent variable into two groups Partitions the independent variable into two groups (High X values and low X values), then tests the null(High X values and low X values), then tests the null

HH00: The groups have equal variances : The groups have equal variances Similar to a pooled variance t-test for difference in two means Similar to a pooled variance t-test for difference in two means of independent samples.of independent samples. It is robust to It is robust to departures from normalitydepartures from normality or error terms or error terms Large sample size essential so that dependencies of error Large sample size essential so that dependencies of error terms on each other can be neglectedterms on each other can be neglectedUses group “median” instead of the “mean”(Why ?)Uses group “median” instead of the “mean”(Why ?)

Tests involving Residuals(Constancy of Error Variance)(Constancy of Error Variance)

The Modified Levine TestThe Modified Levine Test

2)1()1(

points. data of sets twoon these based istest - t thei.e points,datathearedanddtheNow,

~~,

11

21

222

211

i2i1

221111

21

21*

nnsnsns

eedandeedwhere

nns

ddt

iiii

L

Read “Comments” on page 118 and Read “Comments” on page 118 and go thru’ the Breusch-Pagan test on go thru’ the Breusch-Pagan test on page 119.page 119.

F test for Lack of Fit A comparison of “Full Model” sum of

squares error and “Lack of Fit” sum of squares.

For best results, requires repeat observations at, at least one X level.

Full model: Yij=j+ ij (j = mean response when X=Xj)

Reduced model: Yij= 0 + Xj+ ij

(Why “Reduced” ?)

F test for Lack of Fit SSE(Full)=SSPE=

(Labeled “Pure Error” since unbiased estimator of true error variance. See 3.31 and 3.32, page 123)

SSLF=SSE(Reduced)-SSPE, (where SSE(Reduced)= SSE from ordinary least squares regression model)

Test Statistic : (what is “p”?)

j i

jij YY 2

cnSSPE

pcSSLF

F

*

Be sure to compare the ANOVA table on page 126 with OLS ANOVA Be sure to compare the ANOVA table on page 126 with OLS ANOVA table.table.

Overview of some Remedial Measures The Problem: Simple Linear Regression is not appropriate. The solution: 1. Abandon the model (“Eagle to Hawk; abort mission and return to base”.)

2. Remedy the situation:If Non-independent error terms then work with a model that If Non-independent error terms then work with a model that

calls for correlated error terms (Ch.12)calls for correlated error terms (Ch.12)If Heteroscedasticity then use WLS method to estimate If Heteroscedasticity then use WLS method to estimate

parameters (Ch. 10) or use transformations of data.parameters (Ch. 10) or use transformations of data.If scatter plot indicates non-linearity, then either use non-linear If scatter plot indicates non-linearity, then either use non-linear

regression function (Ch.7) or transform to linear. regression function (Ch.7) or transform to linear. NEXT: We will look at one such powerful transformation

method.

The Box-Cox Transformation Method The family of power transforms on Y is given as: Y'=Y

The family easily includes simple transforms such as the square root, squared etc.

By definitiondefinition, when then Y'=logeY When the response variable is so transformed, the normal

error regression model becomes: Yi0 + Xi+ i

We would like to determine the “best” value of ethod 1: Maximum likelihood estimation

n

iiinR

XYLMax1

2102

22,,, 21exp

2

1.42

100

The Box-Cox Transformation Method ethod 2: Numerical SearchStep 1: Set a value of .Step 2: Standardize the Yi observations

If then: Wi=K1(Yi)

If then: Wi=K2(logeYi)

where, K2 and K1

Step 3: Now regress the set W on the set X.Step 4: Note the corresponding SSE. Step 5: Change and repeat steps 2 to 4 until lowest SSE is

obtained.

nn

iiY

/1

1

12

1 K

Let’s try both this method with the GMAT data. Let’s try both this method with the GMAT data. What should we get as the best What should we get as the best