130
III. Model Building

III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Embed Size (px)

Citation preview

Page 1: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

III. Model Building

Page 2: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of y and good predictions of y for given values of the explanatory variables.

Page 3: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Why is model building important, both in statistical analysis & in analysis in general?

Theory & empirical research

Page 4: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

“A social science theory is a reasoned and precise speculation about the answer to a research question, including a statement about why the proposed answer is correct.”

“Theories usually imply several more specific descriptive or causal hypotheses” (King et al., page 19).

Page 5: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

A model is “a simplification of, and approximation to, some aspect of the world.”

“Models are never literally ‘true’ or ‘false,’ although good models abstract only the ‘right’ features of the reality they represent” (King et al., page 49).

Page 6: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Remember: social construction of reality (including alleged causal relations); skepticism; rival hypotheses; & contradictory evidence.

What kinds of evidence (or perspectives) would force me to revise or jettison the model?

Page 7: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Three approaches to model building

Begin with a linear model for its simplicity & as a rough approximation of the y/x relationships.

Begin with a curvilinear model to capture the complexities of the y/x relationships.

Begin with a model that incorporates linearity &/or curvilinearity in y/x relationships according to theory & observation.

Page 8: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

The predominate approach used to be to start with a simple model & to test it against progressively more complex models.

This approach suffers, however, from the problem associated with omitted variables in the simpler models.

Increasingly common, then, is the approach of starting with more complex models & testing against simpler models (Greene, Econometric Analysis, pages 151-52).

Page 9: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

The point of departure for model-building is trying to grasp how the outcome variable y varies as the levels of an explanatory variable change.

We have to know how to write a mathematical equation to model this relationship.

Page 10: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

In what follows, let’s pretend that we’ve already done careful univariate & bivariate exploratory data analysis via graphs & numerical summaries (although, on the other hand, the exercise requires here & there that we didn’t do such careful groundwork…).

Page 11: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Suppose we want to model a person’s performance on an exam, y, as a function of a single explanatory variable, x, the person’s amount of study time.

It may be that the person’s score, y, increases in a straight line as the amount of study time increases from 1 to 6 hours.

Page 12: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

If this were the entire range of x values used to fit the equation, a linear model would be appropriate:

exy 10

Page 13: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

What, though, if the range of sample hours increased to 10 hours or more: would a straight-line model continue to be satisfactory?

Quite possibly, the increase in exam score for a unit increase in study time would decrease, causing some amount of curvature in the y/x relationship.

Page 14: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

What kind of model would be appropriate? A second-order polynomial, called a quadratic:

exxy 2210

Page 15: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

exxy 2210

0

1

2

: value of y when x’s equal 0; shifts the quadratic parabola up or down the y-intercept

: the slop of y on x when x=0 (which we don’t really care about & won’t interpret)

: negative, degree of downward parabola; positive, degree of upward parabola

Page 16: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Recall that the model is valid only for the range of x values used to estimate the model.

What does this imply about predictions for values that exceed this range?

Page 17: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Testing a second-order equation

0

0

2

2

:Ha

:Ho

2 1 If tests significant, do not interpret

.

.(which represents y’s slope on x1 when x1=0).

Page 18: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Let’s continue with this model-building strategy but change the substantive topic.

We’ll focus on the relationship of average hourly wage to a series of explanatory variables (e.g., education & job tenure with the same employer).

Let’s explore the relationship.

Page 19: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

05

10

15

20

25

average h

ourly

earn

ings/F

itte

d v

alu

es

0 5 10 15 20years of education

average hourly earnings Fitted values

Example: use WAGE1, clear

scatter wage educ || qfit wage educ

Page 20: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

wage = -0.905 + 0.541 educr² = 0.165 RMSE = 3.378 n = 526

ave

rage h

ourly

earn

ings

years of education0 5 10 15 20

0

10

20

30

. sparl wage educ

Page 21: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

wage = 5.408 - 0.607 educ + 0.049 educ 2̂r² = 0.201 RMSE = 3.307 n = 526

avera

ge h

ourly e

arn

ings

years of education0 5 10 15 20

0

10

20

30

. sparl wage educ, quad

Page 22: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

. sparl wage educ, logx

wage = -7.460 + 5.330 log educr² = 0.128 RMSE = 3.455 n = 524

avera

ge h

ourly e

arn

ings

years of education0 5 10 15 20

-10

0

10

20

30

Page 23: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

. gen educ2=educ^2

. su educ educ2

. reg wage educ educ2

Source SS df MS Number of obs = 526

F( 2, 523) = 65.79

Model 1439.40602 2 719.70301 Prob > F = 0.0000

Residual 5721.00827 523 10.9388303 R-squared = 0.2010

Adj R-squared = 0.1980

Total 7160.41429 525 13.6388844 Root MSE = 3.3074

wage Coef. Std. Err. t P>t [95% Conf. Interval]

educ -.6074999 .2414904 -2.52 0.012 -1.08191 -.1330897

educ2 .0490724 .0100718 4.87 0.000 .0292862 .0688587

_cons 5.407688 1.458863 3.71 0.000 2.541737 8.273639

Page 24: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

If the second-order term tests significant we don’t we interpret the first-order term.

Why not?

Page 25: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Let’s figure out what the second-order term means in this model.

What do the following graphs say about the relationship of wage to years of education?

Page 26: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

24

68

10

12

95%

CI/F

itte

d v

alu

es

0 5 10 15 20years of education

95% CI Fitted values

. twoway qfitci wage educ

Page 27: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

05

10

15

20

25

avera

ge h

ourl

y e

arn

ings/M

edia

n b

ands

0 5 10 15 20years of education

average hourly earnings Median bands

. scatter wage educ || mband wage educ, ba(8)

Page 28: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

05

10

15

20

25

avera

ge h

ourl

y e

arn

ings

0 5 10 15 20years of education

bandwidth = .2

Lowess smoother

. lowess wage educ, bwidth(.2)

Page 29: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Median Band Regression & Lowess Smoothing

Median band regression (scatter mband y x) & lowess smoothing (lowess y x) are two very helpful tools for detecting (1) how a model fits or doesn’t fit to particular segments of the x –values (e.g., poorer to richer persons) & (2) thus non-linearity.

Hence they’re really useful at all stages of exploratory data analysis.

Another option: locpoly y x1

Page 30: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

What did the graphs say about the relationship of wage to years of education?

Let’s answer this question more precisely by predicting the direction & magnitude of the wage/education relationship at specific levels of education, identified via ‘su x, detail’ and/or our knowledge of the issue:

Page 31: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

. lincom educ*9 + educ2*81

wage Coef. Std. Err. t P>t [95% Conf. Interval]

(1) -1.492631 1.388045 -1.08 0.283 -4.219459 1.234197

. lincom educ*12 + educ2*144wage Coef. Std. Err. t P>t [95% Conf. Interval]

(1) -.2235668 1.514447 -0.15 0.883 -3.198713 2.75158

. lincom educ*16 + educ2*256wage Coef. Std. Err. t P>t [95% Conf. Interval]

(1) 2.842547 1.456761 1.95 0.052 -.0192752 5.70437

. su educ, d

Page 32: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

24

68

10

12

95%

CI/F

itte

d v

alu

es

0 5 10 15 20years of education

95% CI Fitted values

How do the predictions relate to the graph?

Page 33: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Don’t get hung up with every segment of the curve.

The curve is only an approximation. Thus it may not fit the data well within any particular range (especially where there are few observations).

Page 34: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Remember, moreover, that Adj R2 was just .198 for this model.

Obviously there are other relevant explanatory variables. Not only do we need to identify them, but we also need to ask: are they independent & linear? independent & curvilinear? or are they interactional?

Page 35: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Interaction: the effect of a 1-unit change in one explanatory variable depends on the level of another explanatory variable.

With interaction, both the y-intercept & the regression slope change; i.e. the regression lines are not parallel.

Interaction Effects

Page 36: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

exxxxy 21322110

E.g., how do education & job tenure interact with regard to predicted wage?

. gen educXtenure=educ*tenure

Page 37: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

exxxxy 21322110

. reg wage educ tenure educXtenure

Source SS df MS Number of obs = 526

F( 3, 522) = 82.15

Model 2296.41715 3 765.472384 Prob > F = 0.0000

Residual 4863.99714 522 9.31800218 R-squared = 0.3207

Total 7160.41429 525 13.6388844 Adj R-squared = 0.3168

Root MSE = 3.0525

wage Coef. Std. Err. t P>t [95% Conf. Interval]

educ .4265947 .0610327 6.99 0.000 .3066948 . 5464946

tenure -.0822392 .0737709 -1.11 0.265 -.2271635 .0626851

educXtenure .0225057 .0059134 3.81 0.000 .0108887 .0341228

_cons -.4612881 .7832198 -0.59 0.556 -1.999938 1.077362

Page 38: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

0

0

213

2130

xx:H

xx:H

a

Hypothesis test

Page 39: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Let’s interpret the model.

If the interaction term tests significant, we don’t interpret its base variables?

Why not? Each base variable represents its y/x slope when the other x=0. We don’t care about this.

Page 40: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

To interpret the interaction term, we use ‘su x1, d’ & our knowledge of the subject to identify key levels of educ & tenure (or use one SD above mean, mean, & one SD below mean):

Then we predict the slope-effect of educXtenure on wage at the specified levels, as follows:

Page 41: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

How the interaction of mean education with varying levels of tenure relates to average hourly wage:. lincom educ + (educXtenure*2)

wage Coef. Std. Err. t P>t [95% Conf. Interval]

(1) -.0372278 .062391 -0.60 0.551 -.1597961 .0853406

. lincom educ + (educXtenure*10)

wage Coef. Std. Err. t P>t [95% Conf. Interval ]

(1) .142818 .0221835 6.44 0.000 .0992382 .1863978

. lincom educ + (educXtenure*18)

wage Coef. Std. Err. t P>t [95% Conf. Interval]

(1) .3678751 .0503567 7.31 0.000 .2689484 .4668019

Page 42: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

How the interaction of mean tenure with varying levels of education relates to average hourly wage:. lincom tenure + (8*educXtenure)

wage Coef. Std. Err. t P>t [95% Conf. Interval]

(1) .0978065 0303746 3.22 0.001 .0381351 .157478

. lincom tenure + (12*educXtenure)

wage Coef. Std. Err. t P>t [95% Conf. Interval]

(1) .1878294 .0184755 10.17 0.000 .1515338 .224125

. lincom tenure + (20*educXtenure)

wage Coef. Std. Err. t P>t [95% Conf. Interval]

(1) .3678751 .0503567 7.31 0.000 .2689484 .4668019

Page 43: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

With significant interaction, to repeat, both the regression coefficient & the y-intercept change as the levels of the second interacting variable change.

That is, the regression slopes are unequal. What does this mean in the model for average hourly wage?

Page 44: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Our interaction model yielded an Adj R2 of .317.

Given the non-linearity we’ve uncovered, could we increase the explanatory power by combining quadratic & interaction terms?

Page 45: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

exxxxxxy 225

21421322110

. reg wage educ tenure educXtenure educ2 tenure2Source SS df MS Number of obs = 526

F( 5, 520) = 59.98

Model 2619.24058 5 523.848116 Prob > F = 0.0000

Residual 4541.17371 520 8.73302637 R-squared = 0.3658

Adj R-squared = 0.3597

Total 7160.41429 525 13.6388844 Root MSE = 2.9552

wage Coef. Std. Err. t P>t [95% Conf. Interval]

educ -.7069382 .2269283 -3.12 0.002 -1.152747 -.2611293

tenure -.0072781 .0848573 -0.09 0.932 -.1739834 .1594272

educXtenure .0263957 .005787 4.56 0.000 .0150269 .0377645

educ2 .0470478 .0091265 5.16 0.000 .0291184 .0649771

tenure2 -.0050847 .0016688 -3.05 0.002 -.0083632 -.0018062

_cons 5.763382 1.464726 3.93 0.000 2.885874 8.640889

Page 46: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Let’s assess the model’s fit.

Let’s conduct a test of nested models, comparing this new, ‘full’ model to each of the previous, ‘reduced’ models.

Page 47: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Did adding educXtenure, educ2 & tenure2 boost the model’s variance-explaining power by a statistically significant margin?

. test educXtenure educ2 tenure2 ( 1) educXtenure = 0

( 2) educ2 = 0

( 3) tenure2 = 0

F( 3, 520) = 17.47

Prob > F = 0.0000

Page 48: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Did adding educ2 & tenure2 boost the model’s variance-explaining power by a statistically significant margin over the interaction model?

. test educ2 tenure2 ( 1) educ2 = 0

( 2) tenure2 = 0

F( 2, 520) = 18.48

Prob > F = 0.0000

Page 49: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

To conduct a valid test of nested models:

the number of observations for both the complete & reduced models must be equal;

the functional form of y must be the same (e.g., we can’t compare outcome variable ‘wage’ to outcome variable ‘log-wage’).

Valid testing of nested models

Page 50: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

How do we compare non-nested models (i.e. models with the same number of explanatory variables), or nested models that don’t meet the criteria for comparative testing?

Use either the AIC or BIC test statistics: the smaller the score, the better the model fits.

Download the ‘fitstat’ command (see Long/Freese, Regression Models for Categorical Dependent Variables).

Comparing non-nested models

Page 51: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

. reg science read write math female

. fitstat, saving(model1) bic

. reg science read write

. fitstat, using(model1) bic

The output tells whether or not the ‘current’ model is supported &, if it is supported, to what degree.

Page 52: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

And we can display AIC &/or BIC in ‘estimates table.’

. reg science read write math

. estimates store model1

. estimates table model1, stats(N df_m adj_r2 aic bic)

‘ereturn list’ provides the codes.

Page 53: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

For BIC or AIC, “The upshot is that ex post, neither model is discarded; we have merely revised or assessment of the comparative likelihood of the two in the face of the sample data” (Greene, Econometric Analysis, page 153).

That is, the Bayesian approach compares “the two hypotheses rather than testing for the validity of one over the other” (Greene, Econometric Analysis, page 153).

Page 54: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Graphing multiple variables from regression models requires 3-D graphing capabilities (see, e.g., Systat, SAS).

Here’s Stata’s crude version:

Graphing the model

Page 55: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

-0 1

-0

1

wageeduc

educ2

. gr3 wage educ educ2

Page 56: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

-0 1

-0

1

wagetenure

tenure2

. gr3 wage tenure tenure2

Page 57: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

24

68

10

12

95%

CI/F

itte

d v

alu

es

0 5 10 15 20years of education

95% CI Fitted values

. twoway qfitci wage educ, bc(yellow)

Page 58: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

. twoway qfitci wage tenure, bc(red)4

68

10

12

95%

CI/F

itte

d v

alu

es

0 10 20 30 40years w ith current employer

95% CI Fitted values

Page 59: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

What do the graphs tell us about the relationship of wage to years of education & to years of job tenure?

Using lincom to predict the slope for wage at specific values of the interaction variables is important, too.

Page 60: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Adj R2 was .360. Can we improve the model by adding dummy variables?

Let’s explore the possibility for females versus males; nonwhites versus whites; & urban (smsa) versus rural.

Page 61: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

But first, a quick detour. We find evidence that 52% of the population of interest is female & 16% is nonwhite: are the sample percentages significantly different than these population benchmarks?

How do we statistically assess these possibilities?

Page 62: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

. ci female nonwhite, binomial

-- Binomial Exact --

Variable Obs Mean Std. Err. [95% Conf. Interval]

female 526 .4790875 .021782 .4356663 .5227448

nonwhite 526 .1026616 .0132339.0780672 .1318238

We’ll first check the confidence intervals.

Next we’ll try prtest to test the proportions.

Page 63: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

. prtest female=.52

One-sample test of proportion female: Number of obs = 526

Variable Mean Std. Err. [95% Conf. Interval]

female .4790875 .021782 .4363956 .5217793

Ho: proportion(female) = .52

Ha: female < .52 Ha: female != .52 Ha: female > .52

z = -1.878 z = -1.878 z = -1.878

P < z = 0.0302 P > z = 0.0604 P > z = 0.9698

Page 64: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

. prtest nonwhite=.16

One-sample test of proportion nonwhite: Number of obs = 526

Variable Mean Std. Err. [95% Conf. Interval]

nonwhite .1026616 .0132339 .0767235 .1285996

Ho: proportion(nonwhite) = .16

Ha: nonwhite < .16 Ha: nonwhite != .16 Ha: nonwhite > .16

z = -3.587 z = -3.587 z = -3.587

P < z = 0.0002 P > z = 0.0003 P > z = 0.9998

Page 65: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Let’s get back to our regression model, first by exploring the proposed new variables.

Page 66: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

0

1

4.5

55.5

66.5

7

female

Means of wage, average hourly earnings

. grmeanby female, su(wage)

Page 67: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

0

1

5.5

5.6

5.7

5.8

5.9

nonw hite

Means of wage, average hourly earnings

. grmeanby nonwhite, su(wage)

Page 68: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

1

0

4.5

55.5

66.5

smsa

Means of wage, average hourly earnings

. grmeanby smsa, su(wage)

Page 69: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

. tab female, su(wage)

=1 if | Summary of average hourly earnings

female | Mean Std. Dev. Freq.

------------+------------------------------------

0 | 7.1 4.2 274

1 | 4.6 2.5 252

------------+------------------------------------

Total | 5.9 3.7 526

Page 70: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

The sample’s average wage disparities are pronounced for females versus males, & notable but less pronounced for nonwhites versus whites & for urban (smsa) versus rural.

We should examine the wage distribution for each of these categorical, binary variables. Let’s illustrate this for females versus males:

Page 71: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

bys female: su wage-> female = 0

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

wage | 274 7.099489 4.160858 1.5 24.98

_____________________________________________________

-> female = 1

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

wage | 252 4.587659 2.529363 .53 21.63

Page 72: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

05

10

15

20

25

05

10

15

20

25

0 1

Total

avera

ge h

ourl

y e

arn

ings

Graphs by =1 if f emale

. gr box wage, over(female, total) marker(1, mlabel(id))

Page 73: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

. table female, contents(mean wage med wage sd wage min wage max wage)

=1 if female

mean(wage) med(wage) sd(wage) min(wage) max(wage)

0 7.1 6 4.160858 1.5 25

1 4.6 3.8 2.529363 .53 22

Page 74: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Two-sample t test with unequal variances

Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

0 274 7.099489 .2513666 4.160858 6.604626 7.594352

1 252 4.587659 .1593349 2.529363 4.273855 4.901462

combined 526 5.896103 .1610262 3.693086 5.579768 6.212437

diff 2.51183 .2976118 1.926971 3.09669

Satterthwaite's degrees of freedom: 456.327

Ho: mean(0) - mean(1) = diff = 0

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

t = 8.4400 t = 8.4400 t = 8.4400

P < t = 1.0000 P > t = 0.0000 P > t = 0.0000

. ttest wage, by(female) unequal

Page 75: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

11

00 000 1

0010

01

0

1

0

1 111

0

0

0

0

0

10

0

0

1

00

0

00

1

0

10 1

001

1

0

10101001

011

10110

1111100

1111

0

011

10

0

10

01000 01

11

00

000

0

0

1

0

11010

0

1110

0

0

011

0

0

11

0

0

0

01

0

0

1

1

0

1111

1

0

0

1110

0

1

1

1

110

1

11

0

00

1

0

1111

0

1

1001

0

1

0

1

0

1

0

1110

1

1101

0

10111

0

11

0

10

00

00

1

01

0

1

0

00

1

11000

011110

001

1

011

1

0

111

0

1

1

1

0

0

1

0

111

1

0

0

1111

0

1

0

1

010

1

0

0

0

11

0

0

0

11

0

0

1

0

10

1

1

111

01

0

1

00

11

00

11

1

1

0

01110010111

0

1

0

110

0

0

11

11

0

1011

0

1

0

0

11

0

11

0

1

1

1

01

011

1

01

1

1

1

01

1

1

1

1

00

1

0

0

1

10

1

00

00

0

011101

0

1

1

1

0

0

0

0

0

1

0

0

1

00

0

01

10

11

0

1

0

0

0

0

0

1

1

101

0

01 1

0

0

0

1

0

1

0

1

00

0

01

0

11

0

11

0

0

0

0

0

0

1

0

1

1

0

0

0

11

1

0

0

1

0

0

0

1

00

0

0

1

10010

1

1

0

0

1

0

1

0

01

1

00

11

0

10

00

0

0

1

0

00

0

0

0

0

0

1

0001

0

0

0

1

0

0

0

0

00

0

1

00

1

0

0

0

0

1

0

1

00

1

1

0

00

1

0

05

10

15

20

25

avera

ge h

ourl

y e

arn

ings

0 5 10 15 20years of education

bandwidth = .2

Lowess smoother

. lowess wage educ, bwidth(.2) ml(female)

Page 76: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

010

20

30

0 5 10 15 20 0 5 10 15 20

0 1

average hourly earnings Fitted values

avera

ge h

ourl

y e

arn

ings/F

itte

d v

alu

es

years of education

Graphs by =1 if f emale

. scatter wage educ || qfit wage educ, by(female)

Page 77: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

010

20

30

0 5 10 15 20 0 5 10 15 20

0 1

average hourly earnings Median bands

avera

ge h

ourl

y e

arn

ings/M

edia

n b

ands

years of education

Graphs by =1 if f emale

. scatter wage educ || mband wage educ, by(female)

Page 78: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

. reg wage educ tenure educXtenure educ2 tenure2 female nonwhite smsa

Source SS df MS Number of obs = 526

F( 8, 517) = 47.81

Model 3044.645 8 380.580624 Prob > F = 0.0000

Residual 4115.7693 517 7.96086904 R-squared = 0.4252

Adj R-squared = 0.4163

Total 7160.41429 525 13.6388844 Root MSE = 2.8215

Page 79: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

wage Coef. Std. Err. t P>t [95% Conf. Interval]

educ -.6354087 .2180251 -2.91 0.004 -1.063733 -.2070846

tenure -.0437508 .0815164 -0.54 0.592 -.2038949 .1163932

educXtenure .0274386 .0055418 4.95 0.000 .0165515 .0383258

educ2 .0410706 .0087769 4.68 0.000 .0238277 .0583134

tenure2 -.0050909 .0015958 -3.19 0.002 -.0082259 -.0019559

female -1.727369 .254498 -6.79 0.000 -2.227347 -1.227392

nonwhite -.1551834 .4090206 -0.38 0.705 -.9587302 .6483634

smsa .834631 .2822579 2.96 0.003 .2801175 1.389145

_cons 6.215331 1.405937 4.42 0.000 3.45328 8.977382

Page 80: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Let’s re-estimate the model, without nonwhite.

. reg wage educ tenure educXtenure educ2 tenure2 female smsa

Source SS df MS Number of obs = 526

F( 7, 518) = 54.71

Model 3043.49906 7 434.78558 Prob > F = 0.0000

Residual 4116.91523 518 7.9477128 R-squared = 0.4250

Adj R-squared = 0.4173

Total 7160.41429 525 13.6388844 Root MSE = 2.8192

Page 81: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

wage Coef. Std. Err. t P>t [95% Conf. Interval]

educ -.6313552 .2175832 -2.90 0.004 -1.058809 -.2039013

tenure -.0463398 .0811631 -0.57 0.568 -.2057891 .1131095

educXtenure .0275878 .0055232 4.99 0.000 .0167372 .0384385

educ2 .0409219 .0087609 4.67 0.000 .0237105 .0581332

tenure2 -.0050603 .0015924 -3.18 0.002 -.0081887 -.0019319

female -1.726572 .254279 -6.79 0.000 -2.226116 -1.227027

smsa .8338924 .2820179 2.96 0.003 .279853 1.387932

_cons 6.17466 1.400685 4.41 0.000 3.422939 8.926381

There are no notable changes in the coefficients. Let’s conduct a nested model test:

Page 82: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Let’s say we find evidence that the regression slope for wage on female versus male may not be the same in urban versus rural areas.

That is, there may be a statistically significant femaleXsmsa interaction.

Let’s find out:

Page 83: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Normally, we create the interaction variable femaleXsmsa & write the model as follows:

. reg wage educ tenure educXtenure educ2 tenure2 female smsa femaleXsmsa

Stata, however, let’s us do the work on the fly:

. xi:reg wage educ tenure educXtenure educ2 tenure2 i.female*i.smsa

Page 84: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

. xi:reg wage educ tenure educXtenure educ2 tenure2 i.female*i.smsai.female _Ifemale_0-1 (naturally coded; _Ifemale_0 omitted)

i.smsa _Ismsa_0-1 (naturally coded; _Ismsa_0 omitted)

i.fem~e*i.smsa _IfemXsms_#_# (coded as above)

Source SS df MS Number of obs = 526

F( 8, 517) = 48.19

Model 3058.50881 8 382.313602 Prob > F = 0.0000

Residual 4101.90548 517 7.93405315 R-squared = 0.4271

Adj R-squared = 0.4183

Total 7160.41429 525 13.6388844 Root MSE = 2.8167

Page 85: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

wage Coef. Std. Err. t P>t [95% Conf. Interval]

educ -.6125454 .2178258 -2.81 0.005 -1.040478 -.1846129

tenure -.049629 .0811286 -0.61 0.541 -.2090112 .1097533

educXtenure .0278069 .0055208 5.04 0.000 .016961 .0386528

educ2 .0399927 .0087794 4.56 0.000 .022745 .0572405

tenure2 -.0050664 .0015911 -3.18 0.002 -.0081922 -.0019407

_Ifemale_1 -1.18429 .4690306 -2.52 0.012 -2.10573 -.26285

_Ismsa_1 1.193317 .3842972 3.11 0.002 .4383412 1.948293

_IfemXsms_~1-.7594022 .5521189 -1.38 0.170 -1.844075 .3252702

_cons 5.841761 1.420256 4.11 0.000 3.051579 8.631943

We fail to reject the null hypothesis for femaleXsmsa.

Page 86: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Next we hypothesize that average hourly wage varies by economic sector.

So let’s add to the model a series of dummy variables for the economic sectors, the comparison sector being manufacturing.

We hypothesize that the regression slope is the same for each sector but the y-intercept varies:

Page 87: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

. reg wage educ tenure educXtenure educ2 tenure2 female smsa construc ndurman trcommpu trade services profservices

Source SS df MS Number of obs = 526

F( 13, 512) = 33.38

Model 3284.61293 13 252.662533 Prob > F= 0.0000

Residual 3875.80136 512 7.56992453 R-squared = 0.4587

Adj R-squared = 0.4450

Total 7160.41429 525 13.6388844 Root MSE= 2.7513

Economic sectors are compared to manufacturing.

Page 88: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

wage Coef. Std. Err. t P>t [95% Conf. Interval]

educ -.6367266 .2137267 -2.98 0.003 -1.056616 -.2168374

tenure -.0637539 .0796214 -0.80 0.424 -.2201786 .0926709

educXtenure .0266911 .0054182 4.93 0.000 .0160463 .0373358

educ2 .0413504 .0086545 4.78 0.000 .0243477 .0583532

tenure2 -.0044319 .0015651 -2.83 0.005 -.0075067 -.001357

female -1.682962 .2608887 -6.45 0.000 -2.195505 -1.170418

smsa .84172 .2768325 3.04 0.002 .2978526 1.385587

construc -.8289172 .643652 -1.29 0.198 -2.093441 .4356068

ndurman -1.13003 .4770815 -2.37 0.018 -2.067308 -.1927519

trcommpu -1.60342 .6639094 -2.42 0.016 -2.907742 -.2990986

trade -1.910428 .3891094 -4.91 0.000 -2.674875 -1.14598

services -1.928182 .503255 -3.83 0.000 -2.916881 -.9394828

profserv -.7975599 .4156892 -1.92 0.056 -1.614226 .0191064

_cons 7.426713 1.399124 5.31 0.000 4.677982 10.17544

Page 89: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

We test not the individual significance but the joint significance of the dummy variable series:testparm construc-profserv

( 1) construc = 0

( 2) ndurman = 0

( 3) trcommpu = 0

( 4) trade = 0

( 5) services = 0

( 6) profserv = 0

F( 6, 512) = 5.31

Prob > F = 0.0000

Testing the nested model.

Page 90: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

‘testparm’ (test parameters) allows us to enter the first dummy variable in the series, a dash, & the last dummy variable in the series.

‘test’ requires that each dummy variable in the series be entered.

Page 91: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

The model, then, has greatly improved: Adj R2 has reached .455, & the other, more important fit indicators look fine.

But is the slope coefficient for wage really the same for females & males?

Let’s test the assumption of equal slopes.

Page 92: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

We have to estimate a new model, this time interacting the dummy variable female with each of the other explanatory variables.

Why choose ‘female’ as the variable to interact with the others?

Page 93: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

We could create an interaction variable corresponding to female’s interaction with each explanatory variable: femaleXeduc, femaleXtenure, femaleXprofserv, etc.

Better: create an interaction variable only for those main-effect variables that for good reason you think should be expected to vary by gender.

Either way, Stata again allows us to do the work on the fly—but be sure you know how formally to write such a model.

Page 94: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Note: we could have used this approach to create educXtenure, but for pedagogical reasons we created this variable the formal way.

xi:reg wage i.female*educ i.female*tenure i.female*educXtenure i.female*educ2 i.female*tenure2 i.female*smsa i.female*construc i.female*ndurman i.female*trcommpu i.female*trade i.female*services i.female*profserv

For the sake of it, we’ll do a ‘full interaction’ model.

Page 95: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

. testparm _Ifemale_1- _IfemXprofs_1( 1) _Ifemale_1 = 0

( 2) _IfemXeduc_1 = 0

( 3) _IfemXtenur_1 = 0

( 4) _IfemXeducX_1 = 0

( 5) _IfemXeduc2_1 = 0

( 6) _IfemXtenura1 = 0

( 7) _IfemXsmsa_1 = 0

( 8) _IfemXconst_1 = 0

( 9) _IfemXndurm_1 = 0

(10) _IfemXtrcom_1 = 0

(11) _IfemXtrade_1 = 0

(12) _IfemXservi_1 = 0

(13) _IfemXprofs_1 = 0

Constraint 1 dropped

F( 12, 500) = 2.22

Prob > F = 0.0099

Let’s conduct a nested model test: from ‘female’ to the last ‘femX…’ interaction.

Page 96: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Our conclusion?

We reject the null hypothesis: there indeed is statistically significant evidence of unequal wage slopes for females vs. males with regard to average hourly wage.

Substantive meaning?

Page 97: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Key lesson: we should test the baseline notions of linearity & uniform slopes, & when necessary revise the model accordingly.

But we don’t have to do a ‘full interaction’ model.

Don’t take linearity & uniform slopes for granted

Page 98: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Note: The econometric literature discusses the detection of significantly different slopes in terms of the Chow Test. Stata’s joint ‘test’ procedure is equivalent to the Chow Test (type ‘findit Chow Test’, which will lead you to Stata FAQ’s on the subject).

See Wooldridge, Introductory Econometrics, pp. 237-240; & Stata’s online FAQ’s.

Page 99: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

A colleague of yours inspects your statistical work & says “Nice try, but you goofed with regard to the outcome variable.”

Where did we go wrong? Let’s take a look & see.

Page 100: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

0.1

.2.3

Density

0 5 10 15 20 25average hourly earnings

. histogram wage, norm plotr(c(navy))

Page 101: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Average hourly wage is highly right skewed.

How should we address this problem?

Let’s begin by using some helpful Stata tools—qladder & ladder.

Page 102: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

For qladder & ladder, the null hypothesis is that each displayed, transformed distribution is normal.

The alternative hypothesis that it isn’t normal.

So, in ladder & qladder, we want to fail to reject the null hypothesis to obtain an effective normalizing transformation.

Page 103: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

-500005000

10000

15000

-4000 -2000 0 2000 4000 6000

cubic

-2000

200400600

-200 -100 0 100 200 300

square

-100

102030

-5 0 5 10 15

identity

01

23

45

0 1 2 3 4

sqrt-1

01

23

0 1 2 3

log

-1.5

-1-.

50

-.8 -.6 -.4 -.2 -8.33e-17

1/sqrt

-2-1

.5-1-.

50

-.6 -.4 -.2 -5.55e-17 .2

inverse

-4-3

-2-1

0

-.6 -.4 -.2-5.55e-17 .2 .4

1/square

-6-4

-20

2

-1 -.5 0 .5 1

1/cubic

average hourly earningsQuantile-Normal plots by transf ormation

. qladder wage

Page 104: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Transformation formula chi2(2) P(chi2)

cubic wage^3 . 0.000

square wage^2 . 0.000

raw wage . 0.000

square-root sqrt(wage) . 0.000

log log(wage) 13.99 0.001

reciprocal root 1/sqrt(wage) . 0.000

reciprocal 1/wage . 0.000

reciprocal square 1/(wage^2) . 0.000

reciprocal cubic 1/(wage^3) . 0.000

. ladder wage

qladder’s ‘log’ looked helpful, but nothing looks helpful (i.e. insignificant) in ladder. Let’s explore:

Page 105: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

wage = -0.905 + 0.541 educr² = 0.165 RMSE = 3.378 n = 526

avera

ge h

ourly e

arn

ings

years of education0 5 10 15 20

0

10

20

30

. sparl wage educ

Page 106: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

log wage = 0.584 + 0.083 educr² = 0.186 RMSE = 0.480 n = 526

avera

ge h

ourly e

arn

ings

years of education0 5 10 15 20

0

10

20

30

. sparl wage educ, logy

Page 107: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

We’ll opt for a log transformation, which is the most basic way of linearizing a highly right-skewed distribution:

. gen lwage=ln(wage)

. su wage lwage

Note: log(wage) & ln(wage) are equivalent.

Page 108: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

0.1

.2.3

Density

0 5 10 15 20 25average hourly earnings

. histogram wage, norm plotr(c(navy))

Page 109: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

. histogram lwage, norm plotr(c(navy))

0.2

.4.6

.8

Density

-1 0 1 2 3log(w age)

Page 110: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Recall that log transformations require quantitative, ratio variables with positive values—plus not ‘too many’ zero values (see Wooldridge) & ideally a ratio between lowest & highest values of at least 10.

Are the results of the log transformation satisfactory? Why, or why not?

Page 111: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

How do the models fit (again recalling our discussion of comparing non-nested models)? Interpretation?

Quantitative explanatory variables: every per unit change in x multiplies average hourly wage by …, on average, holding the other variables constant.

Categorical explanatory variables: e.g., having a job in services multiplies average hourly wage by …, on average, holding the other variables constant.

Page 112: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Now we need to use lincom to predict wage, or the direction & magnitude of its slope, at specific levels of key explanatory variables (or one SD above mean, mean, & one SD below mean).

We’ll leave that for you to do.

Page 113: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Remember that:

(1) log transformations require quantitative, ratio variables with positive values—plus not ‘too’ many zero values & ideally a ratio of at least 10 between the lowest & highest values;

(2) quadratic (& similar) transformations require quantitative, interval or ratio variables, & not necessarily positive values; &

Some summary points

Page 114: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

(3) in any instance, matters of theory, interpretability, & common sense may lead us not to transform a variable, even though doing so may make sense on purely statistical grounds.

Page 115: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Furthermore, the assumptions of linearity & uniform slopes must be tested.

Compare nested models, using AIC or BIC when the two models’ number of observations are unequal or when the number of explanatory variables is equal.

Don’t make predictions beyond the range of the model’s x-values.

Page 116: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Don’t overfit a model to a data sample: most samples have their quirks, & overfitting a model to such quirks comes at the expense of the model’s generality;

Thus, don’t go overboard with transforming variables & with trying to boost R2.

Page 117: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

And don’t forget median band regression—scatter mband y x, bands(#); & lowess smoothing—lowess y x, bandwidth(.#) (see also locpoly): these are helpful tools at all stages of y/x data analysis.

Page 118: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

We’ll be doing more tranformations as part of regression diagnostics (i.e. assessing & correcting violations of regression’s statistical assumptions & dealing with outliers that distort the results).

Page 119: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

One final question: can we validly compare the magnitude of slope coefficients within a regression model?

Usually not, because their metrics are typically different (e.g., years of education, score on a mental health scale, & quantitative versus categorical variables in regard to average hourly wage).

Page 120: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Standardized regression coefficients

We can, however, validly compare the magnitude of the slope coefficients if we standardize them, which, of course, expresses them as standard deviations on the standard normal distribution.

This is easy to do in Stata:

. reg y x1 x2 x3, beta

Page 121: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

.Source SS df MS

Model 2194.1116 3 731.370532 Number of obs = 526

Residual 4966.30269 522 9.51398984 F( 3, 522) = 76.87

Total 7160.41429 525 13.6388844 Prob > F = 0.0000

R-squared = 0.3064

Adj R-squared = .30

wage Coef. Std. Err. t P>t Beta

educ .5989651 .0512835 11.68 0.000.4490953

exper .0223395 .0120568 1.85 0.064.0820981

tenure .1692687 .0216446 7.82 0.000 ..3311255

_cons -2.872735 .7289643 -3.94 0.000 .

Page 122: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

For every standard deviation increase in education, wage increases by .45 standard deviations on average, holding the other variables constant.

For every standard deviation increase in experience, wage increases by .08 standard deviations on average, holding the other variables constant.

For every standard deviation increase in tenure, wage increases by .33 standard deviations on average, holding the other variables constant.

Page 123: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Standardizing regression coefficients can be quite useful & is commonly done, but it does have serious limitations:

The standardized values depend on the particular sample: comparisons can’t be made across samples.

The standardized values depend on which other variables are included in the equation: change one or more of the variables & the standardized values change.

Page 124: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Comparisons of standardized coefficients, then, can’t be made across regression equations.

Standardization makes no sense for interpreting categorical explanatory variables: there’s no standard deviation change in, e.g., gender, ethnicity, or religion, so don’t bother trying to interpret categorical variables when they’re included in a standardized regression model (but, rather, use standardization to gauge the relative effect of categorical as well as quantitative explanatory variables on the outcome variable).

Page 125: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

And the interpretation of standardized interaction terms can be deceptive.

Page 126: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Here’s a convenient (downloadable) command to obtain the standardized coefficients.

After estimating the model:

. listcoef, std

See Long/Freese for details.

Page 127: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Combining graph curves in Stata

Sometimes it may be helpful to combine straight & curved graphs of twoway scatterplots.

Here are a couple of examples.

Page 128: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

. scatter write math || lfit science math || fpfit science math

30

40

50

60

70

30 40 50 60 70 80math score

writing score Fitted valuespredicted science

Page 129: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

30

40

50

60

70

30 40 50 60 70 80math score

writing score Fitted valuesMedian bands

. scatter write math || lfit science math || mband science math

Page 130: III. Model Building. Model building: writing a model that will provide a good fit to a set of data & that will give good estimates of the mean value of

Summary

What’s a theory? What does it involve?

What’s a model?

Interplay of theory & empirical research?

Approaches to model building? Fundamental principles of model building?