1 5. Multiple Regression II ECON 251 Research Methods

1

5. Multiple Regression II

ECON 251

Research Methods

2

The Regression Modeling Process Multiple regression also introduces complexity in the

modeling process itself. • How does one decide which variables should be considered?• How many variables should be initially considered? • How many variables should there be in your final model?• What does one do if the data cannot be found on the exact

variable of interest? While the answers to many of these questions involve as

much art as science, there are some steps that can guide you along your way.

In this section, we will also introduce a few more tools to help you with some of these decisions.

3

The Modeling Process—Step by Step1. Develop a model that has a sound basis.

― Theoretical and practical inputs into model formation– Working group of experts for brainstorming session– Literature review on factors influencing variable of interest

2. Gather data for the variables in the model.― Gather data for dependent and independent variables― If data cannot be found for the exact variable, use a “proxy.”

– You believe sales of your product follows GDP growth, but you want a model of monthly data, and GDP figures are quarterly. What do you do?

3. Draw the scatter diagram to determine whether a linear model (or other forms) appears to be appropriate.

4. Estimate the model coefficients and statistics using statistical computer software.

4

The Modeling Process—Step by Step5. Assess the model fit and usefulness using the model statistics.

• Use the three step process we developed with simple linear regression.

• Do the variables make sense? (significance, signs)6. Diagnose violations of required conditions. Try to remedy

problems when identified.7. Assess the model fit and usefulness using the model statistics.

• Notice the iterative nature of the process.8. If the model passes the assessment tests, use it to:

• Predict the value of the dependent variables• Provide interval estimates for these predictions• Provide insight into the impact of each independent variable on

the dependent variable. Remember: Statistics informs judgment, it does not replace it. Use

your common sense when developing, finalizing and employing a model!

5

Example – Motel Profitability

La Quinta Motor Inns is planning an expansion. Management wishes to predict which sites are likely to be

profitable. Step #1: Develop a model with a sound basis

• Several predictors of profitability which can be identified include:

―Competition―Market awareness―Demand generators―Demographics―Physical quality

6

Profitability

Competition Market awareness

Demand Generators

Physical

Rooms Nearest Officespace

Collegeenrollment

Income Disttown

Distance toDowntownIn miles.

MedianhouseholdIncome in ‘000s.

Distance in miles to the nearestCompetitor ofLa Quinta inn.

Number of hotel/motelrooms within 3 miles.

Demographics

At this stage, you should also assign your “a priori” expectations of the sign of each coefficient for each independent variable. We’ll use this information when we “assess” the model.

In ‘000s of sq ft w/in 3 miles

In ‘000s of students w/in 3 miles


7

Step #2: Gather Data• Data was collected from randomly selected 100 inns that

belong to La Quinta, and ran for the following suggested model:

Margin = Rooms NearestOfficeCollege

+ 5 Income + 6 Disttwn + εINN MARGIN ROOMS NEAREST OFFICE COLLEGE INCOME DISTTWN1 55.5 3203 0.1 549 8 37 12.12 33.8 2810 1.5 496 17.5 39 0.43 49 2890 1.9 254 20 39 12.24 31.9 3422 1 434 15.5 36 2.75 57.4 2687 3.4 678 15.5 32 7.96 49 3759 1.4 635 19 41 4

INN MARGIN ROOMS NEAREST OFFICE COLLEGE INCOME DISTTWN1 55.5 3203 0.1 549 8 37 12.12 33.8 2810 1.5 496 17.5 39 0.43 49 2890 1.9 254 20 39 12.24 31.9 3422 1 434 15.5 36 2.75 57.4 2687 3.4 678 15.5 32 7.96 49 3759 1.4 635 19 41 4


8

Rooms (vertical axis) vs. Margin (horizontal)

y = -27.179x + 4228.4

R2 = 0.2212

1500

2000

2500

3000

3500

4000

4500

20 25 30 35 40 45 50 55 60 65

Nearest (vertical axis) vs. Margin (horizontal)

y = -0.0183x + 2.8274

R2 = 0.0257

0

1

2

3

4

5

20 30 40 50 60 70

Step #3: Draw Scatter Diagrams Example – Motel Profitability

9

Regression StatisticsMultiple R 0.724611R Square 0.525062Adj. R Sq 0.49442Std. Error 5.512084Observations 100

ANOVAdf SS MS F Sig. F

Regression 6 3123.832 520.6387 17.13581 3.03E-13Residual 93 2825.626 30.38307Total 99 5949.458

Coeffs S.E. t Stat P-valueIntercept 72.45461 7.893104 9.179483 1.11E-14ROOMS -0.00762 0.001255 -6.06871 2.77E-08NEAREST -1.64624 0.632837 -2.60136 0.010803OFFICE 0.019766 0.00341 5.795594 9.24E-08COLLEGE 0.211783 0.133428 1.587246 0.115851INCOME -0.41312 0.139552 -2.96034 0.003899DISTTWN 0.225258 0.178709 1.260475 0.210651

Step #4: Estimate ModelThis is the sample regression eq’n (sometimes called the prediction eq’n)

MARGIN = 72.455 - 0.008 ROOMS -1.646 NEAREST + 0.02 OFFICE +0.212 COLLEGE - 0.413 INCOME + 0.225 DISTTWN


10

Step #5: Assess the Model• We will add a number of steps and sub-steps to our model

assessment process when using multiple regression. The assessment process becomes:

1. R2 (Coefficient of Determination)1b. Adjusted R2

2. F-Test for overall validity of the model3. T-test for slope

– using b (estimate of the slope)– Partial F-test to verify elimination of some independent

variables


11

1a. Coefficient of determination• The definition is

• From the printout, R2 = 0.5251• 52.51% of the variation in the measure of profitability is

explained by the linear regression model formulated above.• Notice that we are not using SSR/SST. This version of the

formula would still work for now, but it will not work once we introduce “Adjusted R2” . . .

SST

SSER 12

Assessing the Model (step #5)

12

1b. The “Adjusted” Coefficient of Determination is defined as:

• As you add additional independent variables to your model, what happens to SST, SSR, and SSE? What happens to R2? R2?

• If all you cared about was a model with a high R2, you might be tempted to increase the number of independent variables almost irrespective of the amount of significant explanatory power each added. Adj R2 penalizes you a small amount for each additional independent variable you add. The new variable must significantly contribute to explaining SST, before Adj R2 will go up.

• From the printout, Adj R2 ( R2 )= 0.4944 or 49.44%• 49.44% of the variation in the measure of profitability is explained by

the linear regression model formulated above after “adjusting for the degrees of freedom,” or the “number of independent variables.”

)]1/([

)]1/([1 Adjusted 2

nSST

knSSER

13

2. The F-Test for Overall Validity of the Model• Recall that in conducting this test, we are posing the

question: Is there at least one independent variable linearly related to the dependent variable?

• To answer the question, we test the hypothesis:

• If at least one i is not equal to zero, the model is valid.

• The F test ― Construct the F statistic

MSEMSR

F F > Fα,k, n-k-1Reject H0 if

H0: 1 = 2 = … = k = 0H1: At least one i is not equal to zero.

14

ANOVAdf SS MS F Significance F


Fα,k,n-k-1 = F0.05,6,100-6-1 = 2.197F = 17.14 > 2.197

Also, the p-value (Significance F) = 3.03382(10)-13

Clearly, α = 0.05>3.03382(10)-13, and the null hypothesis is rejected.

Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. At least one of the i is not equal to zero. Thus, at least one independent variable is linearly related to y. This linear regression model is valid

Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. At least one of the i is not equal to zero. Thus, at least one independent variable is linearly related to y. This linear regression model is valid

Excel provides the following ANOVA results

15

CoefficientsStandard Error t Stat P-value Lower 95% Upper 95%Intercept 72.45461 7.893104 9.179483 1.11E-14 56.78048735 88.12874ROOMS -0.00762 0.001255 -6.06871 2.77E-08 -0.010110582 -0.00513NEAREST -1.64624 0.632837 -2.60136 0.010803 -2.902924523 -0.38955OFFICE 0.019766 0.00341 5.795594 9.24E-08 0.012993085 0.026538COLLEGE 0.211783 0.133428 1.587246 0.115851 -0.053178229 0.476744INCOME -0.41312 0.139552 -2.96034 0.003899 -0.690245235 -0.136DISTTWN 0.225258 0.178709 1.260475 0.210651 -0.12962198 0.580138

3a. Testing the coefficients• The hypothesis for each i

• Example—Motel Profitability

H0: i = 0H1: i ≠ 0

Test statistic

d.f. = n - k -1ib

iis

bt

16

3b. Do the Variables Make Sense? • When you establish which variables you want to use, you should

also establish your “a priori” assumptions regarding the expected sign of the slope coefficients.

• You do this prior to obtaining your actual model results so the actual numbers do not influence your expectations.

• By establishing these expectations, you are more able to identify surprises in your results. These surprises may lead you to additional insight into your model, or may lead you to question your results. Either is useful.

• We have already done this back on slide 6, so go back and find your original assumptions for this example.

Example—Motel Profitability

Margin =Rooms NearestOfficeCollege + 5Income + 6Disttwn

17

• This is the intercept, the value of y when all the variables take the value zero. Since the data range of all the independent variables do not cover the value zero, do not interpret the intercept.

• In this model, for each additional 1000 rooms within 3 mile of the La Quinta inn, the operating margin decreases on the average by 7.6% (assuming the other variables are held constant).

• In this model, for each additional mile that the nearest competitor is to La Quinta inn, the average operating margin decreases by 1.65%. Sensible???

5.72b

0076.b1

65.1b2


18

• For each additional 1000 sq-ft of office space, the average increase in operating margin will be .02%.

• For additional thousand students MARGIN

increases by .21%.

• For additional $1000 increase in median

household income, MARGIN decreases by .41% ???

• For each additional mile to the downtown

center, MARGIN increases by .23% on average???

02.b3

21.b4

41.b5

23.b6


19

Based on the t-tests, one should consider getting rid of both “College” and “Disttwn” • The sign on “Disttwn” is a bit unexpected as well—though if you try

hard you could justify it. These two indications, reinforce one-another. Let’s get rid of it.

• The “College” variable sign is what you would expect, and it’s p-value, while not below 5%, is not that high. Let’s keep this for now, and see what happens when we eliminate “Disttwn.”

While Assumption Violations is officially a separate step, it is usually best to be checking your assumptions at this stage as well.• Recall how dramatically the model changed when we had

autocorrelation. Recall that serious multicollinearity could also be leading me to get rid of some variables that we might really want to keep.


20

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.718990854R Square 0.516947848Adjusted R Square 0.491253584Standard Error 5.529320727Observations 100



Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 75.13707499 7.624565098 9.854604692 3.74364E-16 59.99832996 90.27582ROOMS -0.007742161 0.001255304 -6.167558795 1.7296E-08 -0.010234595 -0.00525NEAREST -1.586922918 0.633058347 -2.506756172 0.013901469 -2.843874466 -0.329971OFFICE 0.019576011 0.00341778 5.727697298 1.21629E-07 0.012789931 0.026362COLLEGE 0.196384877 0.133283011 1.473442678 0.14397261 -0.068251531 0.461021INCOME -0.421411017 0.139833268 -3.013667795 0.003317336 -0.699053108 -0.143769

Notice that when we get rid of “Disttwn,” both R2 AND Adj R2 went down, but the F stat went up. This is where the “art” comes in. Despite the decline in Adj R2, we will eliminate “Disttwn” on the basis of the size of the p-value of the t-test, the sign being wrong and the direction of the change in the F stat. You could successfully argue to keep it as well based on Adj R2. Notice the p-value on “College.”

21

When we got rid of “Disttwn,” the p-value for College actually increased, and now isn’t all that close to 5%. Consequently, we’ll get rid of it. Once we do, we have a similar circumstance as last time, regarding R2, adj R2 and the F stat. This could go either way as well. In our case, we’ll get rid of “College” for now, and do a Partial F-test, and see what that suggests we do about it.

SUMMARY OUTPUT




Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 77.93849422 7.429077126 10.49100621 1.48143E-17 63.18992196 92.68707ROOMS -0.007862522 0.00126034 -6.238412847 1.22182E-08 -0.010364612 -0.00536NEAREST -1.653650492 0.635316269 -2.602877611 0.010726099 -2.914911849 -0.392389OFFICE 0.019607492 0.003438713 5.701984661 1.33212E-07 0.012780787 0.026434INCOME -0.399387121 0.13988637 -2.855082452 0.005283347 -0.677096479 -0.121678

22

There is no particular order in which you should check the assumptions. We’ll check for multicollinearity first, because it is easy to do, and you are also able to look at the correlations between each independent variable and the dependent variable at the same time.

Checking Assumption #5 is most easily done using a correlation table. Notice that I have included all the variables in my original list. Office and Income show the highest absolute value of r, at -0.15. This is quite low.

Notice that we ended up getting rid of the two variables which had the lowest absolute value of r when measured against the value of the dependent variable. This makes sense.

MARGIN ROOMS NEAREST OFFICE COLLEGE INCOME DISTTWNMARGIN 1ROOMS -0.470329396 1NEAREST -0.160251533 -0.081679711 1OFFICE 0.501430771 -0.093475216 -0.042761633 1COLLEGE 0.123012456 -0.063908076 -0.071236621 -0.00103044 1INCOME -0.247500311 -0.037142757 -0.045322352 -0.152614011 0.11263169 1DISTTWN 0.09227191 -0.07300932 0.091286693 -0.032855396 -0.097323583 -0.051541 1

Checking Assumptions

23

Checking Assumptions 1 and 2, there are no obvious violations. We won’t worry about Assumption #3 as this is cross-sectional data, not time-series. We should also have taken care of Assumption #4 as we drew our graphs of each independent variable vs. the dependent variable. We only showed a few of these graphs, but at least in those cases, there did not appear to be a problem with outliers.

Histogram of standardized residuals

0

5

10

15

20

25

-2.30 -1.81 -1.33 -0.84 -0.36 0.12 0.61 1.09 1.58 2.06 More

Bin

Fre

qu

en

cy

Standardized Residuals Vs Predicted

-3

-2

-1

0

1

2

3

30 35 40 45 50 55 60

Checking Assumptions

24

3c. The Partial F-test (Wald test)• How does one decide how many variables to keep in your

final model? Do you keep all the variables, some of them? • While there is some “art” to this process, we will use the

following process. 1. First, consider your individual t-test results.

• Which variables should you keep on this basis? • Are there any variables that officially should be eliminated, but are

close to having a small enough p-value to be retained?• Are there any variables you believe strongly “must” be in the model

irrespective of the results of the t-test?

2. Once you have made your decisions, then conduct the “Partial F-test” to verify your results.

Assessing the Model (step #5)

25

H0: 1 = 2 = … = i = 0

H1: At least one i is not equal to zero.

Where:

is refer only to those variables which were eliminated from the original regression;

SSRf is from the full equation; SSRr is from the reduced equation;

MSEf is from the full equation; Kd is the number of variables eliminated.

The test statistic is determined by the difference in SSR (full model) vs. SSR (reduced model). If there is a large difference, some of the variables you eliminated have significant explanatory power. If this is the case, you will reject H0, conclude some coefficients from the variables you eliminated are non-zero, and use the “full model”.

This test will always be a one-sided upper tail test by its nature.

fd knkf

drfF

MSE

kSSRSSR)1()(0 ,

/)(if HReject

fd knk

f

drfF

MSE

kSSRSSR)1()(0 ,

/)(if HReject

26

The test statistic for the Partial F-test:

= [(3123.83 - 3009.184)/2]/30.38307 = = 57.323/30.38307 = 1.8867

REDUCED df SS MS F Significance FRegression 4 3009.184 752.2959 24.30661 7.23E-14Residual 95 2940.274 30.95026Total 99 5949.458

The ANOVA results for the reduced and the full models are:

FULL df SS MS F Significance FRegression 6 3123.832 520.6387 17.13581 3.03382E-13Residual 93 2825.626 30.38307Total 99 5949.458

f

drf

MSE

kSSRSSR /)(F Partial


27

Fα, kd, n-k-1 = F0.05, 2, 100-6-1 = 3.095 critical value

Test statistic: partial F = 1.8867

F = 1.8867 < 3.095 therefore, DNR H0

Conclusion: There is insufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. The independent variables eliminated from the regression do not appear to be different from 0, and hence have no explanatory power. The reduced model appears to be the most appropriate model in this case.

Conclusion: There is insufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. The independent variables eliminated from the regression do not appear to be different from 0, and hence have no explanatory power. The reduced model appears to be the most appropriate model in this case.


28

Assume you have conducted two regressions using the same data. The first regression on the “full model” had 9 independent variables, and a sample size of 200. You then run a “reduced model” after eliminating 4 of the independent variables that appeared insignificant on the basis of t-tests.

Data for Full Model Data for Reduced Model

SSR = 95,532 SSR = 7,978

MSE = 654. MSE = 13,431

Conduct a partial F-test. F0.05, 4, 190= 2.41918485.

Conduct the same test, this time assuming the SSR from your reduced model was 92,300.

Example – Partial F test

29

Step #6: Diagnose Violations of Required Conditions• We already did this in concert with Step #5, and that is the

way you really should do it. You cannot effectively assess the model, without having considered whether the assumptions have been violated.

• We separate them into steps only because both are so critical to constructing a useful regression model.

• Having to combine these critical steps is another manner in which the “art” of regression analysis becomes obvious.


30

Step #7: Assess the ModelWe now have our final model. You should be able to do the

assessment on your own at this stage.

SUMMARY OUTPUT




Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 77.93849422 7.429077126 10.49100621 1.48143E-17 63.18992196 92.68707ROOMS -0.007862522 0.00126034 -6.238412847 1.22182E-08 -0.010364612 -0.00536NEAREST -1.653650492 0.635316269 -2.602877611 0.010726099 -2.914911849 -0.392389OFFICE 0.019607492 0.003438713 5.701984661 1.33212E-07 0.012780787 0.026434INCOME -0.399387121 0.13988637 -2.855082452 0.005283347 -0.677096479 -0.121678


31

Example – Motel Profitability Step #8: Use the Model

• Use the model to predict the profit margin of three possible locations.

• What are your expectations for profit margins in each location?• Where should we recommend LaQuinta to locate the next motel?• What seem to be the deciding factors in this case?

Characteristics Athens (OH)Bloomington (IN)

Miami (OH)

Rooms 2672 2,500 2,300

Competitor Distance 1.3 1.2 0.5

Office Space (‘000s) 952 1430 655

Students (‘000s) 17 21 15

Income (‘000s) 35 37 33.5

Dist to Downtown 3.4 4.5 1.4

Documents

1 5. Multiple Regression II ECON 251 Research Methods