Lab 2 – Binary Choice in Cross Section and Panel Datapeople.stern.nyu.edu/wgreene/GCEP/Lab2-2015.pdf · Lab 2 – Binary Choice in Cross Section and Panel Data . Based on GSOEP

Lab 2 – Binary Choice in Cross Section and Panel Data Based on GSOEP Health Care Data. N = 7,293Households, T = 1,…,7 periods (unbalanced panel). Examine various specifications based on health statusit = f( age, education, gender, marital status, public insurance, income)+ εit.

union membership, education)it Variable HSAT = health satisfaction coded 0,1,…,10. We analyze health status = 1(HSAT > 6) For most of this exercise, we will examine the impact of incomeon health. We will look at several specifications and some aspects of nonlinear estimation. The script and data set for this exercise are in files Lab2-2015.lim healthcare.lpj. The beginning of the script defines the panel data set and transforms the 11 point scale variable HSAT to the binary outcome, healthy. We take a look at the data and define a specification for the model.

I. Cross Section Variation

The probit, logit, and “linear probability” models produce different coefficient estimates. The A. Compare Functional Forms.

important question to consider is whether they produce different analyses and conclusions about the data. Examine the three sets of coefficients.

Researchers sometimes report odds ratios instead of coeffcients with logit models. What are odds B. Odds Ratios

ratios? What story do they tell about the dependent variable? Examine a set of odds ratios next to a set of partial effects.

Examine the distribution of income in the data. (The extreme skew is not substantive.) C. The Effect of Changes in Income on Reported Health Status

Fit a probit model that contains income on the RHS. (It is the last variable in X.) What is the coefficient. Is the “income effect” statistically significant? Does this result make sense? To examine how income relates to health status, simulate the probabilities. (The simulation fits income at the specified values for each observation an averages over all the other effects.) Because of the probability function, the effect is already nonlinear in income. We add a quadratic term in income in the model. What do you conclude about the shape of the income effect in the model? Now what does the simulation say about the relation between income and health status.

D. An Experiment: Scaling in the Binary Choice Model. Start from the Model y*=β′x + e, y = 1(β′x+e > 0), Prob(y=1|x)=Φ(β′x). If e is heteroscedastic, with standard deviation s(e) = exp(-γ′z), then the resulting model will be Prob(y=1|x)=Φ(β′x × exp(γ′z) ). This could be viewed as a scaling effect on the coefficients. For our example, suppose z = FEMALE. Then, the model has differently scaled coefficients for men and women. This exercise shows how to program a new model (there is actually a built in command for this model, but we’ll do it the more interesting way. Fit the model, then using SIMULATE, see the implication of the functional form for the diffeernt models for men and women. For an extension, add WORKING to z and determine if the gender effect persists.

E. Hypothesis Test In the model in part C, the income effect is carried by two coefficients, a linear term and a quadratic one. In order to test the hypothesis that income is not related to health status, we test the hypothesis that these two coefficients equal zero. (1) The Wald test is common, and requires a bit of matrix algebra. For simple hypotheses, the Wald test is often built into program command languages. The statistic has a limiting chi-squared distribution with 2 degrees of freedom. What is the result of the test? (2) The likelihood ratio test is easy to carry out. It requires two estimations. What is the result of the test? Is the chi squared statistic similar to the Wald statistic? (3) The LM test is not commonly used except in a few specific cases. It is actually rather simple to construct as well. What is the result of the LM test of the hypothesis? Is the result similar to the other two tests? F. Partial Effects In nonlinear modeling, partial effects are more important than coefficients. Based on the probit model (the results would be essentially the same for logit), compute (1) The partial effects for age for men and women separately – are they noticeably different? (2) The partial effect of age as income varies from 0 to 2 in steps of .2. (This is most of the range of income. (3) The gender difference in the probability for health status. G. Compare logit to probit Recompute the partial effect in E.(3) for a logit model instead of a probit model. Are they similar? H. Average Partial Effects vs. Partial Effects at the Mean Are these two things really different? Compute both for the logit model in part F and compare. I. Partial Effects Vs. Elasticities

In computing partial effects, researchers sometimes, because the probabilities are between zeroand one, interpret partial effects as percentage changes. In his paper o cheating in the Chicago School system, with B. Jacob, S. Levitt interpreted a coefficient of 0.057 in an LPM as an “effect of 5.7%.” In fact, only 0.9% of theobservations in the sample equaled one, so the average probability in the sample was 0.009. The effect of a .057 change in the probability, from a base of 0.009 was more like 600%! (an outrageous result). It is necessary to compute and interpret partial effects and elasticities carefully. Compute the partial effects and elasticities for our health model based on the binary variable FEMALE and the continuous variable, INCOME. Note, the effects are less dramatic here because our average probability of 0.60 is nearer to the center of the (0,1) range. What do you find in the comparisons?

J. Homogeneity Test In examining micro- data such as the GSOEP, researchers sometimes analyze data on women and men separately. It might be of interest to test to see whether the separation is supported by the data. The Chow test would be used for linear regressions. A likelihood ratio test is used for nonlinear models. Using the LR test, carry out the homogeneity test for our health data. What is the result of the test?

J. Endogeneity and 2SLS

One might suspect that income is endogenous in our health status equation. One’s first instinct might be to find an instrument and use an IV estimator. The first might be straightforward enough. But, for the second, the binary choice model is nonlinear. In fact, a fully specified model is available – Stata calls this the “IV Probit” model, though in fact it is an MLE, not IV. Here, I will use 4 variables as instruments, hhkids = dummy variable for kids in the household, whitec = dummy variable for holds white collar job, handdum = dummy variable for individual is handicapped, univ = 1 for individual holds university degree. Compute the full information MLE that treats income as endogenous. (1) The estimate of RHO can be tested to test the endogeneity hypothesis. If RHO = 0, we conclude that income is exogenous. What do you find? (2) What is the partial effect of income in this extended model? (3) If you disliked the probit model, you might want to use a “linear model,” and use 2SLS here instead. Compute the 2SLS estimates and report the partial (“causal”) effect of income on health status.

II. Panel Data A. Robust Covariance Matrix Two forms of ‘corrected’ covariance matrices are now standard in the literature, a ‘robust’ covariance matrix what would be the White estimator for heteroscedasticity in the linear model, and a ‘cluster corrected’ covariance matrix that corrects for grouping. We consider these here. (1) compute the pooled estimator (2) redo (1) but compute the HAC estimator (3) redo (1) but ‘cluster’ on the panel grouping in the data. Examine all three results. Does (2) change (1) much? Does (3)? B. The Incidental Parameters (IP) Problem

The IP problem is a widely observed and (I contend) not well understood effect that observes when fixed effects models are fit by maximum likelihood. It is a peristent bias in the coefficient estimators. With respect to the logit model, three is a consistent estimator available, the ‘Chamberlain estimator,’ that is conditional on a sufficient statistic (the sum of the outcomes). There is also an unconditional estimator that amounts to just computing the dummy variable coefficients along with the other coefficients. Two things are known for sure about the logit model: (1) When T=2, the bias is 100%; the MLE estimates 2β. (2) The bias diminishes as T grows. Here, using our unbalanced panel, we will examine the IP problem. (A) using data for which T=2 compute the two estimators and compare them. (B) Repeat the exercise with T = 7. Has the (apparent) bias diminished? (C) Repeat the exercise with T = 3. What is the pattern you are seeing?

C. Random Effects A random effects model is an alternative to the problematic FEM. The REM assumes that the common effects are uncorrelated with the regressors, a possibly strong assumption. But, it has the virtue of being consistently estimated. Compute the parameters of the REM and compare the results to the FEM. Notice that there is an LM statistic for testing the hypothesis of no random effects presented with the REM results. What is the result of the test?

D. Mundlak Approach

Mundlak’s approach to the FEM is often used as a compromise between the FEM and the REM. In practical terms, Mundlak involves adding the group means of the time invariant variables to the base model and treating the augmented model as a random effects model. A (loose) test of RE vs. FE in this setting involves testing the joint hypothesis that the coefficients on the group means are all zero. Fit the model and carry out the test. What did you find?

E. Simulation Based Estimation The REM can be viewed as a binary choice model with a random normally distributed constant term, and estimated using simulation by treating it as a random paramters model (with a random constant term). Fit the model by MSL and compare the results to the Butler and Moffitt estimator

in part D. They should be the same except for some random chatter. Does that appear to be the case?

Documents

Lab 2 – Binary Choice in Cross Section and Panel Datapeople.stern.nyu.edu/wgreene/GCEP/Lab2-2015.pdf · Lab 2 – Binary Choice in Cross Section and Panel Data . Based on GSOEP