25
Stat 112: Lecture 16 Notes • Finish Chapter 6: – Influential Points for Multiple Regression (Section 6.7) – Assessing the Independence Assumptions and Remedies for Its Violation (Section 6.8) • Homework 5 due next Thursday. I will e-mail it tonight. • Please let me know of any ideas you want to discuss for the final project.

Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Stat 112: Lecture 16 Notes

• Finish Chapter 6: – Influential Points for Multiple Regression

(Section 6.7)– Assessing the Independence Assumptions

and Remedies for Its Violation (Section 6.8)

• Homework 5 due next Thursday. I will e-mail it tonight.

• Please let me know of any ideas you want to discuss for the final project.

Page 2: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Multiple regression, modeling and outliers, leverage and influential points

Pollution Example

• Data set pollution.JMP provides information about the relationship between pollution and mortality for 60 cities between 1959-1961.

• The variables are• y (MORT)=total age adjusted mortality in deaths per 100,000

population; • PRECIP=mean annual precipitation (in inches);

EDUC=median number of school years completed for persons 25 and older; NONWHITE=percentage of 1960 population that is nonwhite; NOX=relative pollution potential of Nox (related to amount of tons of Nox emitted per day per square kilometer);

SO2=relative pollution potential of SO2

Page 3: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Multiple Regression: Steps in Analysis

1. Preliminaries: Define the question of interest. Review the design of the study. Correct errors in the data.

2. Explore the data. Use graphical tools, e.g., scatterplot matrix; consider transformations of explanatory variables; fit a tentative model; check for outliers and influential points.

3. Formulate an inferential model. Word the questions of interest in terms of model parameters.

Page 4: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Multiple Regression: Steps in Analysis Continued

4. Check the Model. (a) Check the model assumptions of linearity, constant variance, normality. (b) If needed, return to step 2 and make changes to the model (such as transformations or adding terms for interaction and curvature); (c) Drop variables from the model that are not of central interest and are not significant.

5. Infer the answers to the questions of interest using appropriate inferential tools (e.g., confidence intervals, hypothesis tests, prediction intervals).

6. Presentation: Communicate the results to the intended audience.

Page 5: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Scatterplot Matrix

• Before fitting a multiple linear regression model, it is good idea to make scatterplots of the response variable versus the explanatory variable. This can suggest transformations of the explanatory variables that need to be done as well as potential outliers and influential points.

• Scatterplot matrix in JMP: Click Analyze, Multivariate Methods and Multivariate, and then put the response variable first in the Y, columns box and then the explanatory variables in the Y, columns box.

Page 6: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

800900

1050

10

30

50

70

9.010.0

11.5

515

30

50

150

250

350

50

150

250

MORT

800 950 1150

PRECIP

10 30 50 70

EDUC

9.0 10.5 12.5

NONWHITE

5 1525 35

NOX

50150 300

SO2

50 150250

Scatterplot Matrix

Page 7: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Crunched Variables

• When an X variable is “crunched – meaning that most of its values are crunched together and a few are far apart – there will be influential points. To reduce the effects of crunching, it is a good idea to transform the variable to log of the variable.

Page 8: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

2. a) From the scatter plot of MORT vs. NOX we see that NOX values are crunched very tight. A Log transformation of NOX is needed.

b) The curvature in MORT vs. SO2 indicates a Log transformation for SO2 may be suitable.

After the two transformations we have the following correlations:

MORT PRECIP

EDUC NONWHITE

NOX SO2 Log(NOX)

Log(SO2)

MORT 1.0000 0.5095 -0.5110 0.6437 -0.0774 0.4259 0.2920 0.4031 PRECIP 0.5095 1.0000 -0.4904 0.4132 -0.4873 -0.1069 -0.3683 -0.1212 EDUC -0.5110 -0.4904 1.0000 -0.2088 0.2244 -0.2343 0.0180 -0.2562 NONWHITE 0.6437 0.4132 -0.2088 1.0000 0.0184 0.1593 0.1897 0.0524 NOX -0.0774 -0.4873 0.2244 0.0184 1.0000 0.4094 0.7054 0.3582 SO2 0.4259 -0.1069 -0.2343 0.1593 0.4094 1.0000 0.6905 0.7738 Log(NOX) 0.2920 -0.3683 0.0180 0.1897 0.7054 0.6905 1.0000 0.7328 Log(SO2) 0.4031 -0.1212 -0.2562 0.0524 0.3582 0.7738 0.7328 1.0000

Page 9: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

800

950

1100

10

305070

9.0

10.5

12.0

5

20

35

50

150250350

50

150

250

0

246

0

246

MORT

800 9501100

PRECIP

10 30 50 70

EDUC

9.0 10.512.0

NONWHITE

5 15 25 35

NOX

50 150 300

SO2

50 150250

Log(NOX)

0 1 2 3 4 5 6

Log(SO2)

0 1 2 3 4 5 6

Page 10: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Influential Points, High Leverage Points, Outliers in Multiple

Regression• As in simple linear regression, we identify high leverage

and high influence points by checking the leverages and Cook’s distances (Use save columns to save Cook’s D Influence and Hats).

• High influence points: Cook’s distance > 1• High leverage points: Hat greater than (3*(# of

explanatory variables + 1))/n is a point with high leverage. These are points for which the explanatory variables are an outlier in a multidimensional sense.

• Use same guidelines for dealing with influential observations as in simple linear regression.

• Point that has unusual Y given its explanatory variables: point with a residual that is more than 3 RMSEs away from zero.

Page 11: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Response MORT Summary of Fit RSquare 0.688278 RSquare Adj 0.659415 Root Mean Square Error 36.30065 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 940.6541 94.05424 10.00 <.0001 PRECIP 1.9467286 0.700696 2.78 0.0075 EDUC -14.66406 6.937846 -2.11 0.0392 NONWHITE 3.028953 0.668519 4.53 <.0001 log NOX 6.7159712 7.39895 0.91 0.3681 log S02 11.35814 5.295487 2.14 0.0365 Residual by Predicted Plot

-100

-50

0

50

100

MO

RT

Res

idua

l

New Orleans, LA

750 800 850 900 950 100010501100

MORT Predicted

0

0.5

1

1.5

2

New Orleans, LA

Cook’s Distances

NewOrleanshasCook’sDistancegreater than 1 –New Orleans may be influential.

3 RMSEs=108No points are outliersin residuals

Page 12: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Labeling Observations

• To have points identified by a certain column, go the column, click Columns and click Label (click Unlabel to Unlabel).

• To label a row, go to the row, click rows and click label.

Page 13: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Multiple Regression with New Orleans Summary of Fit RSquare 0.688278 RSquare Adj 0.659415 Root Mean Square Error 36.30065 Mean of Response 940.3568 Observations (or Sum Wgts) 60 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 5 157115.28 31423.1 23.8462 Error 54 71157.80 1317.7 Prob > F C. Total 59 228273.08 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 940.6541 94.05424 10.00 <.0001 PRECIP 1.9467286 0.700696 2.78 0.0075 EDUC -14.66406 6.937846 -2.11 0.0392 NONWHITE 3.028953 0.668519 4.53 <.0001 Log NOX 6.7159712 7.39895 0.91 0.3681 Log SO2 11.35814 5.295487 2.14 0.0365

Multiple Regression without New Orleans Summary of Fit RSquare 0.724661 RSquare Adj 0.698686 Root Mean Square Error 32.06752 Mean of Response 937.4297 Observations (or Sum Wgts) 59 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 5 143441.28 28688.3 27.8980 Error 53 54501.26 1028.3 Prob > F C. Total 58 197942.54 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 852.3761 85.9328 9.92 <.0001 PRECIP 1.3633298 0.635732 2.14 0.0366 EDUC -5.666948 6.52378 -0.87 0.3889 NONWHITE 3.0396794 0.590566 5.15 <.0001 Log NOX -9.898442 7.730645 -1.28 0.2060 Log SO2 26.032584 5.931083 4.39 <.0001

Removing New Orleans has a large impact on the coefficients of log NOX and log SO2, in particular, it reverses the sign of log S02.

Page 14: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Dealing with New Orleans

• New Orleans is influential. • New Orleans also has high leverage,

hat=0.45>(3*6/60)=0.2. • Thus, it is reasonable to exclude New

Orleans from the analysis, report that we excluded New Orleans, and note that our model does not apply to cities with explanatory variables in the range of New Orleans’.

Page 15: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Leverage Plots

• A “simple regression view” of a multiple regression coefficient. For x j:

Residual y (w/o xj) vs. Residual xj (vs the rest of x’s)(both axes are recentered)

• Slope in leverage plot: coefficient for that variable in the multiple regression

• Distances from the points to the LS line are multiple regression residuals. Distance from point to horizontal line is the residual if the explanatory variable is not included in the model.

• Useful to identify (for xj)outliersleverageinfluential points

(Use them the same way as in a simple regression to identify the effect of points for the regression coefficient

of a particular variable)• Leverage plots are particularly useful for points which are influential

for a particular coefficient in the regression.

Page 16: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

PRECIP Leverage Plot

750

800

850

900

950

1000

1050

1100

1150

MO

RT

Leverage R

esid

uals

10 20 30 40 50 60

PRECIP Leverage, P=0.0075

EDUC Leverage Plot

750

800

850

900

950

1000

1050

1100

1150

MO

RT

Leverage R

esid

uals

9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0

EDUC Leverage, P=0.0392

NONWHITE Leverage Plot

750

800

850

900

950

1000

1050

1100

1150

MO

RT

Leverage R

esid

uals

-5 0 5 10 15 20 25 30 35 40

NONWHITE Leverage, P<.0001

Log NOX Leverage Plot

750

800

850

900

950

1000

1050

1100

1150

MO

RT

Leverage R

esid

uals

0 1 2 3 4 5 6

Log NOX Leverage, P=0.3681

Log SO2 Leverage Plot

750

800

850

900

950

1000

1050

1100

1150

MO

RT

Leverage R

esid

uals

-1 0 1 2 3 4 5 6

Log SO2 Leverage, P=0.0365

The enlarged observation New Orleans is an outlier for estimating each coefficient and is highly leveraged for estimating the coefficients of interest on log Nox and log SO2. Since New Orleans is both highly leveraged and an outlier, we expect it to be influential.

Page 17: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Response MORT Summary of Fit RSquare 0.724661 RSquare Adj 0.698686 Root Mean Square Error 32.06752 Mean of Response 937.4297 Observations (or Sum Wgts) 59 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 852.3761 85.9328 9.92 <.0001 PRECIP 1.3633298 0.635732 2.14 0.0366 EDUC -5.666948 6.52378 -0.87 0.3889 NONWHITE 3.0396794 0.590566 5.15 <.0001 log NOX -9.898442 7.730645 -1.28 0.2060 log S02 26.032584 5.931083 4.39 <.0001

-100

-50

0

50

100

MO

RT

Res

idua

l

750 800 850 900 950 1000 1050 1100

MORT Predicted

Analysis without New Orleans

Page 18: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Checking the ModelDistributions Residual MORT

-100

-50

0

50

100.01 .05 .10 .25 .50 .75 .90 .95 .99

-3 -2 -1 0 1 2 3

Normal Quantile Plot

Bivariate Fit of Residual MORT By PRECIP

-100

-50

0

50

100

Res

idua

l MO

RT

0 10 20 30 40 50 60 70

PRECIP

Bivariate Fit of Residual MORT By EDUC

-100

-50

0

50

100

Res

idua

l MO

RT

8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5

EDUC

Page 19: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Bivariate Fit of Residual MORT By NONWHITE

-100

-50

0

50

100R

esid

ual M

OR

T

0 5 10 15 20 25 30 35 40

NONWHITE

Bivariate Fit of Residual MORT By log NOX

-100

-50

0

50

100

Res

idua

l MO

RT

-1 0 1 2 3 4 5 6

log NOX

Bivariate Fit of Residual MORT By log S02

-100

-50

0

50

100

Res

idua

l MO

RT

-1 0 1 2 3 4 5 6

log S02

Linearity, constant varianceand normality assumptions all appear reasonable.

Page 20: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Inference About Questions of Interest

• Strong evidence that mortality is positively associated with S02 for fixed levels of precipitation, education, nonwhite, NOX.

• No strong evidence that mortality is associated with NOX for fixed levels of precipitation, education, nonwhite, S02.

Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Lower 95% Upper 95% Intercept 780.82216 24.42239 31.97 <.0001 731.85821 829.78611 PRECIP 1.520283 0.608129 2.50 0.0155 0.3010583 2.7395076 NONWHITE 3.0510356 0.589079 5.18 <.0001 1.8700043 4.2320669 log NOX -11.72078 7.423624 -1.58 0.1202 -26.60425 3.1626888 log S02 28.343404 5.28898 5.36 <.0001 17.739639 38.94717

Page 21: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Multiple Regression and Causal Inference

• Goal: Figure out what the causal effect on mortality would be of decreasing air pollution (and keeping everything else in the world fixed)

• Lurking variable: A variable that is associated with both air pollution in a city and mortality in a city.

• In order to figure out whether air pollution causes mortality, we want to compare mean mortality among cities with different air pollution levels but the same values of the confounding variables.

• If we include all of the lurking variables in the multiple regression model, the coefficient on air pollution represents the change in the mean of mortality that is caused by a one unit increase in air pollution.

• If we omit some of the lurking variables, then there is omitted variables bias, i.e., the multiple regression coefficient on air pollution does not measure the causal effect of air pollution.

Page 22: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Time Series Data and Autocorrelation

• When Y is a variable collected for the same entity (person, state, country) over time, we call the data time series data.

• For time series data, we need to consider the independence assumption for the simple and multiple regression model.

• Independence Assumption: The residuals are independent of one another. This means that if the residual is positive this year, it needs to be equally likely for the residuals to be positive or negative next year, i.e., there is no autocorrelation.

• Positive autocorrelation: Positive residuals are more likely to be followed by positive residuals than by negative residuals.

• Negative autocorrelation: Positive residuals are more likely to be followed by negative residuals than by positive residuals.

Page 23: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Example: Melanoma Incidence

• Is the incidence of melanoma (skin cancer) increasing over time? Is melanoma related to solar radiation?

• We address thse questions by looking at melanoma incidence among males from the Connecticut Tumor Registry from 1936 to 1972. Data is in melanoma.JMP

Page 24: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Response MELANOMA Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -225.1075 13.25673 -16.98 <.0001 SUNSPOT 0.0014998 0.001397 1.07 0.2904 YEAR 0.1165497 0.006787 17.17 <.0001

Bivariate Fit of Residual MELANOMA By YEAR

-1

-0.5

0

0.5

1

Res

idua

l ME

LAN

OM

A

1935 1940 1945 1950 1955 1960 1965 1970 1975

YEAR

Residuals suggestpositive autocorrelation.

Page 25: Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies

Test of Independence

• The Durbin-Watson test is a test of whether the residuals are independent. The null hypothesis is that the residuals are independent and the alternative hypothesis is that the residuals are not independent (either positively or negatively) autocorrelated.

• To compute Durbin-Watson test in JMP, after Fit Model, click the red triangle next to Response, click Row Diagnostics and click Durbin-Watson Test. Then click red triangle next to Durbin-Watson to get p-value.

• For melanoma data, • Remedy for autocorrelation: Add lagged value of Y to

model.

Durbin-Watson Durbin-Watson Number of Obs. AutoCorrelation Prob<DW

1.1859274 37 0.3774 0.0018