Upload
sandra-annis-rice
View
220
Download
1
Tags:
Embed Size (px)
Citation preview
1
5. Multiple Regression II
ECON 251
Research Methods
2
The Regression Modeling Process Multiple regression also introduces complexity in the
modeling process itself. • How does one decide which variables should be considered?• How many variables should be initially considered? • How many variables should there be in your final model?• What does one do if the data cannot be found on the exact
variable of interest? While the answers to many of these questions involve as
much art as science, there are some steps that can guide you along your way.
In this section, we will also introduce a few more tools to help you with some of these decisions.
3
The Modeling Process—Step by Step1. Develop a model that has a sound basis.
― Theoretical and practical inputs into model formation– Working group of experts for brainstorming session– Literature review on factors influencing variable of interest
2. Gather data for the variables in the model.― Gather data for dependent and independent variables― If data cannot be found for the exact variable, use a “proxy.”
– You believe sales of your product follows GDP growth, but you want a model of monthly data, and GDP figures are quarterly. What do you do?
3. Draw the scatter diagram to determine whether a linear model (or other forms) appears to be appropriate.
4. Estimate the model coefficients and statistics using statistical computer software.
4
The Modeling Process—Step by Step5. Assess the model fit and usefulness using the model statistics.
• Use the three step process we developed with simple linear regression.
• Do the variables make sense? (significance, signs)6. Diagnose violations of required conditions. Try to remedy
problems when identified.7. Assess the model fit and usefulness using the model statistics.
• Notice the iterative nature of the process.8. If the model passes the assessment tests, use it to:
• Predict the value of the dependent variables• Provide interval estimates for these predictions• Provide insight into the impact of each independent variable on
the dependent variable. Remember: Statistics informs judgment, it does not replace it. Use
your common sense when developing, finalizing and employing a model!
5
Example – Motel Profitability
La Quinta Motor Inns is planning an expansion. Management wishes to predict which sites are likely to be
profitable. Step #1: Develop a model with a sound basis
• Several predictors of profitability which can be identified include:
―Competition―Market awareness―Demand generators―Demographics―Physical quality
6
Profitability
Competition Market awareness
Demand Generators
Physical
Rooms Nearest Officespace
Collegeenrollment
Income Disttown
Distance toDowntownIn miles.
MedianhouseholdIncome in ‘000s.
Distance in miles to the nearestCompetitor ofLa Quinta inn.
Number of hotel/motelrooms within 3 miles.
Demographics
At this stage, you should also assign your “a priori” expectations of the sign of each coefficient for each independent variable. We’ll use this information when we “assess” the model.
In ‘000s of sq ft w/in 3 miles
In ‘000s of students w/in 3 miles
Example – Motel Profitability
7
Step #2: Gather Data• Data was collected from randomly selected 100 inns that
belong to La Quinta, and ran for the following suggested model:
Margin = Rooms NearestOfficeCollege
+ 5 Income + 6 Disttwn + εINN MARGIN ROOMS NEAREST OFFICE COLLEGE INCOME DISTTWN1 55.5 3203 0.1 549 8 37 12.12 33.8 2810 1.5 496 17.5 39 0.43 49 2890 1.9 254 20 39 12.24 31.9 3422 1 434 15.5 36 2.75 57.4 2687 3.4 678 15.5 32 7.96 49 3759 1.4 635 19 41 4
INN MARGIN ROOMS NEAREST OFFICE COLLEGE INCOME DISTTWN1 55.5 3203 0.1 549 8 37 12.12 33.8 2810 1.5 496 17.5 39 0.43 49 2890 1.9 254 20 39 12.24 31.9 3422 1 434 15.5 36 2.75 57.4 2687 3.4 678 15.5 32 7.96 49 3759 1.4 635 19 41 4
Example – Motel Profitability
8
Rooms (vertical axis) vs. Margin (horizontal)
y = -27.179x + 4228.4
R2 = 0.2212
1500
2000
2500
3000
3500
4000
4500
20 25 30 35 40 45 50 55 60 65
Nearest (vertical axis) vs. Margin (horizontal)
y = -0.0183x + 2.8274
R2 = 0.0257
0
1
2
3
4
5
20 30 40 50 60 70
Step #3: Draw Scatter Diagrams Example – Motel Profitability
9
Regression StatisticsMultiple R 0.724611R Square 0.525062Adj. R Sq 0.49442Std. Error 5.512084Observations 100
ANOVAdf SS MS F Sig. F
Regression 6 3123.832 520.6387 17.13581 3.03E-13Residual 93 2825.626 30.38307Total 99 5949.458
Coeffs S.E. t Stat P-valueIntercept 72.45461 7.893104 9.179483 1.11E-14ROOMS -0.00762 0.001255 -6.06871 2.77E-08NEAREST -1.64624 0.632837 -2.60136 0.010803OFFICE 0.019766 0.00341 5.795594 9.24E-08COLLEGE 0.211783 0.133428 1.587246 0.115851INCOME -0.41312 0.139552 -2.96034 0.003899DISTTWN 0.225258 0.178709 1.260475 0.210651
Step #4: Estimate ModelThis is the sample regression eq’n (sometimes called the prediction eq’n)
MARGIN = 72.455 - 0.008 ROOMS -1.646 NEAREST + 0.02 OFFICE +0.212 COLLEGE - 0.413 INCOME + 0.225 DISTTWN
Example – Motel Profitability
10
Step #5: Assess the Model• We will add a number of steps and sub-steps to our model
assessment process when using multiple regression. The assessment process becomes:
1. R2 (Coefficient of Determination)1b. Adjusted R2
2. F-Test for overall validity of the model3. T-test for slope
– using b (estimate of the slope)– Partial F-test to verify elimination of some independent
variables
Example – Motel Profitability
11
1a. Coefficient of determination• The definition is
• From the printout, R2 = 0.5251• 52.51% of the variation in the measure of profitability is
explained by the linear regression model formulated above.• Notice that we are not using SSR/SST. This version of the
formula would still work for now, but it will not work once we introduce “Adjusted R2” . . .
SST
SSER 12
Assessing the Model (step #5)
12
1b. The “Adjusted” Coefficient of Determination is defined as:
• As you add additional independent variables to your model, what happens to SST, SSR, and SSE? What happens to R2? R2?
• If all you cared about was a model with a high R2, you might be tempted to increase the number of independent variables almost irrespective of the amount of significant explanatory power each added. Adj R2 penalizes you a small amount for each additional independent variable you add. The new variable must significantly contribute to explaining SST, before Adj R2 will go up.
• From the printout, Adj R2 ( R2 )= 0.4944 or 49.44%• 49.44% of the variation in the measure of profitability is explained by
the linear regression model formulated above after “adjusting for the degrees of freedom,” or the “number of independent variables.”
)]1/([
)]1/([1 Adjusted 2
nSST
knSSER
13
2. The F-Test for Overall Validity of the Model• Recall that in conducting this test, we are posing the
question: Is there at least one independent variable linearly related to the dependent variable?
• To answer the question, we test the hypothesis:
• If at least one i is not equal to zero, the model is valid.
• The F test ― Construct the F statistic
MSEMSR
F F > Fα,k, n-k-1Reject H0 if
H0: 1 = 2 = … = k = 0H1: At least one i is not equal to zero.
14
ANOVAdf SS MS F Significance F
Regression 6 3123.832 520.6387 17.13581 3.03382E-13Residual 93 2825.626 30.38307Total 99 5949.458
Fα,k,n-k-1 = F0.05,6,100-6-1 = 2.197F = 17.14 > 2.197
Also, the p-value (Significance F) = 3.03382(10)-13
Clearly, α = 0.05>3.03382(10)-13, and the null hypothesis is rejected.
Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. At least one of the i is not equal to zero. Thus, at least one independent variable is linearly related to y. This linear regression model is valid
Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. At least one of the i is not equal to zero. Thus, at least one independent variable is linearly related to y. This linear regression model is valid
Excel provides the following ANOVA results
15
CoefficientsStandard Error t Stat P-value Lower 95% Upper 95%Intercept 72.45461 7.893104 9.179483 1.11E-14 56.78048735 88.12874ROOMS -0.00762 0.001255 -6.06871 2.77E-08 -0.010110582 -0.00513NEAREST -1.64624 0.632837 -2.60136 0.010803 -2.902924523 -0.38955OFFICE 0.019766 0.00341 5.795594 9.24E-08 0.012993085 0.026538COLLEGE 0.211783 0.133428 1.587246 0.115851 -0.053178229 0.476744INCOME -0.41312 0.139552 -2.96034 0.003899 -0.690245235 -0.136DISTTWN 0.225258 0.178709 1.260475 0.210651 -0.12962198 0.580138
3a. Testing the coefficients• The hypothesis for each i
• Example—Motel Profitability
H0: i = 0H1: i ≠ 0
Test statistic
d.f. = n - k -1ib
iis
bt
16
3b. Do the Variables Make Sense? • When you establish which variables you want to use, you should
also establish your “a priori” assumptions regarding the expected sign of the slope coefficients.
• You do this prior to obtaining your actual model results so the actual numbers do not influence your expectations.
• By establishing these expectations, you are more able to identify surprises in your results. These surprises may lead you to additional insight into your model, or may lead you to question your results. Either is useful.
• We have already done this back on slide 6, so go back and find your original assumptions for this example.
Example—Motel Profitability
Margin =Rooms NearestOfficeCollege + 5Income + 6Disttwn
17
• This is the intercept, the value of y when all the variables take the value zero. Since the data range of all the independent variables do not cover the value zero, do not interpret the intercept.
• In this model, for each additional 1000 rooms within 3 mile of the La Quinta inn, the operating margin decreases on the average by 7.6% (assuming the other variables are held constant).
• In this model, for each additional mile that the nearest competitor is to La Quinta inn, the average operating margin decreases by 1.65%. Sensible???
5.72b
0076.b1
65.1b2
Example – Motel Profitability
18
• For each additional 1000 sq-ft of office space, the average increase in operating margin will be .02%.
• For additional thousand students MARGIN
increases by .21%.
• For additional $1000 increase in median
household income, MARGIN decreases by .41% ???
• For each additional mile to the downtown
center, MARGIN increases by .23% on average???
02.b3
21.b4
41.b5
23.b6
Example – Motel Profitability
19
Based on the t-tests, one should consider getting rid of both “College” and “Disttwn” • The sign on “Disttwn” is a bit unexpected as well—though if you try
hard you could justify it. These two indications, reinforce one-another. Let’s get rid of it.
• The “College” variable sign is what you would expect, and it’s p-value, while not below 5%, is not that high. Let’s keep this for now, and see what happens when we eliminate “Disttwn.”
While Assumption Violations is officially a separate step, it is usually best to be checking your assumptions at this stage as well.• Recall how dramatically the model changed when we had
autocorrelation. Recall that serious multicollinearity could also be leading me to get rid of some variables that we might really want to keep.
Example – Motel Profitability
20
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.718990854R Square 0.516947848Adjusted R Square 0.491253584Standard Error 5.529320727Observations 100
ANOVAdf SS MS F Significance F
Regression 5 3075.559456 615.1118912 20.11919311 1.3555E-13Residual 94 2873.898444 30.5733877Total 99 5949.4579
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 75.13707499 7.624565098 9.854604692 3.74364E-16 59.99832996 90.27582ROOMS -0.007742161 0.001255304 -6.167558795 1.7296E-08 -0.010234595 -0.00525NEAREST -1.586922918 0.633058347 -2.506756172 0.013901469 -2.843874466 -0.329971OFFICE 0.019576011 0.00341778 5.727697298 1.21629E-07 0.012789931 0.026362COLLEGE 0.196384877 0.133283011 1.473442678 0.14397261 -0.068251531 0.461021INCOME -0.421411017 0.139833268 -3.013667795 0.003317336 -0.699053108 -0.143769
Notice that when we get rid of “Disttwn,” both R2 AND Adj R2 went down, but the F stat went up. This is where the “art” comes in. Despite the decline in Adj R2, we will eliminate “Disttwn” on the basis of the size of the p-value of the t-test, the sign being wrong and the direction of the change in the F stat. You could successfully argue to keep it as well based on Adj R2. Notice the p-value on “College.”
21
When we got rid of “Disttwn,” the p-value for College actually increased, and now isn’t all that close to 5%. Consequently, we’ll get rid of it. Once we do, we have a similar circumstance as last time, regarding R2, adj R2 and the F stat. This could go either way as well. In our case, we’ll get rid of “College” for now, and do a Partial F-test, and see what that suggests we do about it.
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.711190008R Square 0.505791227Adjusted R Square 0.484982437Standard Error 5.563295396Observations 100
ANOVAdf SS MS F Significance F
Regression 4 3009.183612 752.2959031 24.30661354 7.22973E-14Residual 95 2940.274288 30.95025566Total 99 5949.4579
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 77.93849422 7.429077126 10.49100621 1.48143E-17 63.18992196 92.68707ROOMS -0.007862522 0.00126034 -6.238412847 1.22182E-08 -0.010364612 -0.00536NEAREST -1.653650492 0.635316269 -2.602877611 0.010726099 -2.914911849 -0.392389OFFICE 0.019607492 0.003438713 5.701984661 1.33212E-07 0.012780787 0.026434INCOME -0.399387121 0.13988637 -2.855082452 0.005283347 -0.677096479 -0.121678
22
There is no particular order in which you should check the assumptions. We’ll check for multicollinearity first, because it is easy to do, and you are also able to look at the correlations between each independent variable and the dependent variable at the same time.
Checking Assumption #5 is most easily done using a correlation table. Notice that I have included all the variables in my original list. Office and Income show the highest absolute value of r, at -0.15. This is quite low.
Notice that we ended up getting rid of the two variables which had the lowest absolute value of r when measured against the value of the dependent variable. This makes sense.
MARGIN ROOMS NEAREST OFFICE COLLEGE INCOME DISTTWNMARGIN 1ROOMS -0.470329396 1NEAREST -0.160251533 -0.081679711 1OFFICE 0.501430771 -0.093475216 -0.042761633 1COLLEGE 0.123012456 -0.063908076 -0.071236621 -0.00103044 1INCOME -0.247500311 -0.037142757 -0.045322352 -0.152614011 0.11263169 1DISTTWN 0.09227191 -0.07300932 0.091286693 -0.032855396 -0.097323583 -0.051541 1
Checking Assumptions
23
Checking Assumptions 1 and 2, there are no obvious violations. We won’t worry about Assumption #3 as this is cross-sectional data, not time-series. We should also have taken care of Assumption #4 as we drew our graphs of each independent variable vs. the dependent variable. We only showed a few of these graphs, but at least in those cases, there did not appear to be a problem with outliers.
Histogram of standardized residuals
0
5
10
15
20
25
-2.30 -1.81 -1.33 -0.84 -0.36 0.12 0.61 1.09 1.58 2.06 More
Bin
Fre
qu
en
cy
Standardized Residuals Vs Predicted
-3
-2
-1
0
1
2
3
30 35 40 45 50 55 60
Checking Assumptions
24
3c. The Partial F-test (Wald test)• How does one decide how many variables to keep in your
final model? Do you keep all the variables, some of them? • While there is some “art” to this process, we will use the
following process. 1. First, consider your individual t-test results.
• Which variables should you keep on this basis? • Are there any variables that officially should be eliminated, but are
close to having a small enough p-value to be retained?• Are there any variables you believe strongly “must” be in the model
irrespective of the results of the t-test?
2. Once you have made your decisions, then conduct the “Partial F-test” to verify your results.
Assessing the Model (step #5)
25
H0: 1 = 2 = … = i = 0
H1: At least one i is not equal to zero.
Where:
is refer only to those variables which were eliminated from the original regression;
SSRf is from the full equation; SSRr is from the reduced equation;
MSEf is from the full equation; Kd is the number of variables eliminated.
The test statistic is determined by the difference in SSR (full model) vs. SSR (reduced model). If there is a large difference, some of the variables you eliminated have significant explanatory power. If this is the case, you will reject H0, conclude some coefficients from the variables you eliminated are non-zero, and use the “full model”.
This test will always be a one-sided upper tail test by its nature.
fd knkf
drfF
MSE
kSSRSSR)1()(0 ,
/)(if HReject
fd knk
f
drfF
MSE
kSSRSSR)1()(0 ,
/)(if HReject
26
The test statistic for the Partial F-test:
= [(3123.83 - 3009.184)/2]/30.38307 = = 57.323/30.38307 = 1.8867
REDUCED df SS MS F Significance FRegression 4 3009.184 752.2959 24.30661 7.23E-14Residual 95 2940.274 30.95026Total 99 5949.458
The ANOVA results for the reduced and the full models are:
FULL df SS MS F Significance FRegression 6 3123.832 520.6387 17.13581 3.03382E-13Residual 93 2825.626 30.38307Total 99 5949.458
f
drf
MSE
kSSRSSR /)(F Partial
Example – Motel Profitability
27
Fα, kd, n-k-1 = F0.05, 2, 100-6-1 = 3.095 critical value
Test statistic: partial F = 1.8867
F = 1.8867 < 3.095 therefore, DNR H0
Conclusion: There is insufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. The independent variables eliminated from the regression do not appear to be different from 0, and hence have no explanatory power. The reduced model appears to be the most appropriate model in this case.
Conclusion: There is insufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. The independent variables eliminated from the regression do not appear to be different from 0, and hence have no explanatory power. The reduced model appears to be the most appropriate model in this case.
Example – Motel Profitability
28
Assume you have conducted two regressions using the same data. The first regression on the “full model” had 9 independent variables, and a sample size of 200. You then run a “reduced model” after eliminating 4 of the independent variables that appeared insignificant on the basis of t-tests.
Data for Full Model Data for Reduced Model
SSR = 95,532 SSR = 7,978
MSE = 654. MSE = 13,431
Conduct a partial F-test. F0.05, 4, 190= 2.41918485.
Conduct the same test, this time assuming the SSR from your reduced model was 92,300.
Example – Partial F test
29
Step #6: Diagnose Violations of Required Conditions• We already did this in concert with Step #5, and that is the
way you really should do it. You cannot effectively assess the model, without having considered whether the assumptions have been violated.
• We separate them into steps only because both are so critical to constructing a useful regression model.
• Having to combine these critical steps is another manner in which the “art” of regression analysis becomes obvious.
Example – Motel Profitability
30
Step #7: Assess the ModelWe now have our final model. You should be able to do the
assessment on your own at this stage.
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.711190008R Square 0.505791227Adjusted R Square 0.484982437Standard Error 5.563295396Observations 100
ANOVAdf SS MS F Significance F
Regression 4 3009.183612 752.2959031 24.30661354 7.22973E-14Residual 95 2940.274288 30.95025566Total 99 5949.4579
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 77.93849422 7.429077126 10.49100621 1.48143E-17 63.18992196 92.68707ROOMS -0.007862522 0.00126034 -6.238412847 1.22182E-08 -0.010364612 -0.00536NEAREST -1.653650492 0.635316269 -2.602877611 0.010726099 -2.914911849 -0.392389OFFICE 0.019607492 0.003438713 5.701984661 1.33212E-07 0.012780787 0.026434INCOME -0.399387121 0.13988637 -2.855082452 0.005283347 -0.677096479 -0.121678
Example – Motel Profitability
31
Example – Motel Profitability Step #8: Use the Model
• Use the model to predict the profit margin of three possible locations.
• What are your expectations for profit margins in each location?• Where should we recommend LaQuinta to locate the next motel?• What seem to be the deciding factors in this case?
Characteristics Athens (OH)Bloomington (IN)
Miami (OH)
Rooms 2672 2,500 2,300
Competitor Distance 1.3 1.2 0.5
Office Space (‘000s) 952 1430 655
Students (‘000s) 17 21 15
Income (‘000s) 35 37 33.5
Dist to Downtown 3.4 4.5 1.4