January 2009 - federation.ens.frfederation.ens.fr/wheberg/parischoeco/formation... · CEF when y is binary 1 The CEF is a ﬁCPFﬂ: a probability that takes value between 0 and 1

Lecture 1: Models for Binary Outcomes

Luc Behaghel

PSE

January 2009

Luc Behaghel (PSE) Binary outcomes January 2009 1 / 41

Limited dependent variables

Variables that are �naturally� limited:

Binary outcomes: e.g. married / not married; employed / notemployed; smoker / non smoker;Other discrete outcomes: e.g. number of years of schooling; number ofnew hires in a �rm in a given month; very satis�ed / rather satis�ed /unsatis�ed; preferred transportation (bus / car / train);Non-negative variables: e.g. income; number of hours worked.

Variables recorded as limited although the underlying outcome is not:

Discretized variables: e.g. Income recorded in brackets;Censored variables: e.g. Top-coded income.

) Frequent! Are all dependent variables limited?


Binary outcomes

Most extreme case of limited dependent variables: y can take only twovalues, usually denoted 0 and 1

y =�0 if unemployed1 if employed

y =�0 if high-school dropout1 if high-school graduate

) Linear models (OLS, 2SLS) inappropriate?


Outline

1 ask what is the issue with linear models;2 present models that have become standard in such contexts (logit andprobit);

3 explain how these statistical models can (sometimes) be derived froma theoretical framework;

4 introduce a new estimation method used for these non linear models:(conditional) maximum of likelihood estimation (CMLE);

5 discuss how estimates are to be interpreted.


A speci�cation issueReminder on Conditional Expectation Functions

CEFs are a tool to answer �factual questions�and �conditionalcounterfactual questions�.

E.g. E (y jT ,C ) with C : �controls"; T : �treatment�, �variable ofinterest�. Two types of parameters of interest:

T is discrete: discrete e¤ect of going from T0 to T1

E (y jT = T1,C = C0)� E (y jT = T0,C = C0).

T is continuous: marginal e¤ect:

∆T0 jC0 �∂E (y jT = T0,C = C0)

∂T.


We don�t know the funtional form g in E (y jT ,C ) � g(T ,C ). Hard toestimate if T and C can take many values.

) parametric �assumption� (rather: linear approximation of the CEF):

E (y jT ,C ) = Cβ+ αT .

Rem.

1 Flexible approximation: g(.) needs not be linear in T and C . E.g.E (y jT ,C ) = Cβ+ α1T + α2T 2.

2 The marginal e¤ect can depend on where the e¤ect is evaluated. e.g.∂E (y jT=T0,C=C0)

∂T = α1 + 2α2T0.


CEF when y is binary

1 The CEF is a �CPF�: a probability that takes value between 0 and 1

E (y jx) = 1� Pr(y = 1jx) + 0� Pr(y = 0jx)= Pr(y = 1jx).

) why not use this a priori information?2 The predicted probabilities in a linear model can take values outsideof [0; 1]) lack of internal consistency.

So is a linear probability model Pr(y = 1jx) = xβ still useful? Angrist andPischke (2009): yes; Wooldridge (2002): not the best solution.It depends...


Two examples

1 A randomized experiment with binary treatment and binaryoutcome. Unemployed workers are randomly allocated to a programthat provides intensive counseling (T = 1) or to the standard track,with less intensive interventions (T = 0). The outcome is theemployment status after 6 months: employed (y = 1) or not (y = 0).

2 The relationship between a continuous variable (age) andwomen�s labor supply. Women�s labor supply is a classic example ofbinary variables. Here: a simple descriptive analysis based on theFrench Labor Force Survey (Enquête emploi). Exercise: replicateAngrist and Evans (1998) on the impact of the number of children onfemale labor force participation.


Two examples

1 A randomized experiment with binary treatment and binaryoutcome. Unemployed workers are randomly allocated to a programthat provides intensive counseling (T = 1) or to the standard track,with less intensive interventions (T = 0). The outcome is theemployment status after 6 months: employed (y = 1) or not (y = 0).

2 The relationship between a continuous variable (age) andwomen�s labor supply. Women�s labor supply is a classic example ofbinary variables. Here: a simple descriptive analysis based on theFrench Labor Force Survey (Enquête emploi). Exercise: replicateAngrist and Evans (1998) on the impact of the number of children onfemale labor force participation.


Speci�cation (continued)Example 1: Saturated case

Both the independent variable of interest (T ) and the outcome (y) arebinary. The data can be summarized in a simple two-way table:

y = 0 y = 1

T = 0 150 150

T = 1 100 200

The CEF is extremely simple:

E (y jT ) =�E (y jT = 1) for the treatedE (y jT = 0) for the controls


Therefore, we can write, without making any restriction:

E (y jT ) = E (y jT = 0) + (E (y jT = 1)� E (y jT = 0))� T� α0 + α1T .

) �linear probability model�(LPM):

E (y jT ) = Pr(y = 1jT ) = α0 + α1T .

What is your estimate of α1?


Estimation of LPMs

Linear model ) OLS or 2SLS.

Heteroskedasticity. Pay attention to the standard errors: they need tobe robust to heteroskedasticity, as the structure of the model impliesheteroskedasticity. You can check it in the example:

Var(ujT ) = E (u2jT )� (E (ujT ))2

= ...

= Pr(y = 1jT )(1� Pr(y = 1jT )).

The variance of the residual depends on T .) In practice, robust standard errors computed by standard packages(e.g., in Stata, just add the option �, robust�).


A note on saturated models

E (y jT ) = E (y jT = 0) + (E (y jT = 1)� E (y jT = 0))� T� α0 + α1T .

is an example of �saturated model�. It is saturated in the sense that weintroduce as many parameters to estimate as there are distinct values thatthe CEF can take. As a consequence, the model does not approximate theCEF; it describes it fully. Of course, this is feasible when the CEF can takeonly a �nite (and not too large) number of values. Here, the CEF takesonly two values.Question: what would be the �saturated model� in a slightly morecomplicated experiment, where there are two treatments that are randomlyand independently allocated: (i) counseling (T = 0 or 1); (ii) eligibility toa bonus payment if the worker �nds a job within six months (B = 1 ifeligible, 0 otherwise)? Write the model so that we can assumeE (ujT ,B) = 0 without actually imposing any constraint.


Speci�cation (continued)Example 2: A continuous regressor

Non

em

ploy

edE

mpl

oyed

Em

ploy

men

t

20 40 60 80Age

bandwidth = .05

Non parametric estimation

Figure: Employment status according to age. Non parametric estimation usinglocally weighted regressions.


Pr(y = 1jx) = xβ

Non

em

ploy

edEm

ploy

ed

20 40 60 80Age

Nonparametric estimation OLS ageOLS age, age^2 OLS age, age^2, age^3

Linear probability model

Figure: Fit using a linear probability model


Pr(y = 1jx) = Φ(xβ)

Non

em

ploy

edEm

ploy

ed

20 40 60 80Age

Nonparametric estimation PROBIT agePROBIT age, age^2 PROBIT age, age^2, age^3

Probit model

Figure: Fit using a Probit model


Pr(y = 1jx) = Λ(xβ) � exp(xβ)

1+ exp(xβ)

Non

em

ploy

edEm

ploy

ed

20 40 60 80Age

Nonparametric estimation LOGIT ageLOGIT age, age^2 LOGIT age, age^2, age^3

Logit model

Figure: Fit using a Logit model


Index LPM, Logit and Probit are based on a linear function of x : xβ,called the index. Probit and Logit use non linear transformations (Φ or Λ)to ensure that the probability takes its values between 0 and 1. Otherfunctions are possible.Summary LPM, Logit and Probit can do a good job to summarize theCEF, as long as they are used in a �exible way. Logit and Probit constrainprobabilities to be between 0 and 1. This is an advantage over LPM whenthe model is not saturated, as it provides internal consistency.


Latent model interpretation of Logit and Probit

Classic example: the labor supply of women8<:max u(c , 1� l)s.t. c = R + wl

l � 0

(l : labor, R : other sources of income, c : consumption, 1� l : leisure).Lagrangian (with multiplier µ)

L = u(R + wl , 1� l) + µl .

CPO: �wuc � ul + µ = 0

µl = 0


Two cases:

1 If µ > 0, then l = 0 and ucul(R, 1) < 1

w ;

2 If µ = 0, then l > 0 and ucul(R + wl , 1� l) = 1

w .

Interpretation: two groups of women.

1 ucul(R, 1) < 1

w : starting from a situation with 0 hour worked, themarginal bene�t of working, wuc (evaluated at (R, 1)), is lower thanthe marginal bene�t of taking another hour of leisure, ul .

2 ucul(R, 1) > 1

w : these women have a net bene�t of working a �rsthour. They increase their hours until reaching l� such thatucul(R + wl�, 1� l�) = 1

w .

) economic model for employment indicator y :

y =�0 if ucul (R, 1)�

1w � 0

1 if ucul (R, 1)�1w > 0

(1)


From the economic to the statistical modelAssume

1 Statistical model for the MRS and the wages:

ucul(R, 1) = αR + Zγ+ ε

1w

= W δ+ η

(Z and W : observed determinants of MRS and wages; ε and η :unobserved determinants). Note u � ε� η:

y =�0 if xβ+ u � 01 if xβ+ u > 0

(with xβ � αR + Zγ+W δ).2 Distributional assumption for u given x

ujx � N(0, 1)


) Statistical model:

y =

�0 if xβ+ u � 01 if xβ+ u > 0

ujx � N(0, 1)

) conditional probability functions:

Pr(y = 1jx) = Pr(�u < xβ)= Φ(xβ)

= Probit model.


Note 1: Structural approach 6= descriptive approach.

(theoretical model ) statistical speci�cation) vs. (data ) statisticalspeci�cation).

Two types of assumptions to derive the statistical model:

parametric assumptions on functional forms or distributions (linearity,normality)more fundamental independence assumptions on the correlationsbetween unobserved and observed variables (independence).

) a good model is robust to the choice of alternative parametricassumptions (as these choices are arbitrary) and has a good economicrationale to justify the independence assumptions.


Note 2: Structural and reduced form parametersIdeally, you don�t write a theoretical model whose only job is to justifyyour statistical speci�cation. The theoretical model should really guide youon what variables to introduce, and whether or not they are likely to becorrelated with the error term; moreover, one would like to be able to goback from the empirical model to the structural parameters of thetheoretical model. One such parameter, for instance, would be theelasticity of substitution between leisure and consumption. Here, we havenot su¢ ciently developed the model for that, and the connection betweenthe theoretical and the statistical model is loose. All we get in β arereduced-form parameters (i.e. they don�t have a direct economicinterpretation).


�Latent�modelGeneral latent (index) model:

y � = xβ+ u

with �y = 1 if y � > 0y = 0 if y � � 0

y , not y �, is observed. y � is a �latent� variable we use to interpret whatwe observe. When y � increases, the probability to observe y = 1 increases,and vice-versa. So y � can generally be interpreted as a propensity to haveoutcome y = 1.Our example y = 1 if wuc � ul > 0; y � = wuc � ul .) y � is the marginal net bene�t of increasing labor supply from 0 to 1hour.


Maximum of likelihood estimation

Assumptions

We observe a random sample ((y1, x1), (y2, x2), ..., (yn, xn))

The sample is drawn from a distribution with joint density f (y , x).

This density is known up to some �nite parameter, β.

The maximum likelihood estimator of β is the value of β thatmaximizes the probability that the sample((y1, x1), (y2, x2), ..., (yn, xn)) was observed.


Example 1 : y is drawn in a binomial distribution: 1 with probability p and0 with probability 1� p. Here, there are no x 0s and the only unknownparameter is p. We want to estimate p by ML based on a sample wherethe frequency of 1�s is y = 0.4. The likelihood is the probability to obtainthis sample, as a function of p:

L(y1, ..., yn) =n

∏i=1Pr(y = yi )

= p0.4n(1� p)0.6n

(the �rst line uses the fact that the draws are independent). bpMLE is thevalue of p that maximizes p0.4n(1� p)0.6n. It is often easier to maximizelog L(y1, ..., yn). Check that you get

bp = 0.4 = y .The ML estimator of p is simply the empirical frequency of 1�s in thesample.


Unconditional likelihood of a random sample((y1, x1), (y2, x2), ..., (yn, xn)):

log L(y , x jγ) = ∑ilog f (yi , xi jγ).

) requires to speci�cy the joint density f .Conditional likelihood Model only how y depends on x :

log L(y , x jγ) = ∑ilog f (yi , xi jγ)

= ∑ilog [f (yi jxi ,γ)f (xi jγ)]

= ∑ilog f (yi jxi ,γ) +∑

ilog f (xi jγ)

= log L(y jx ,γ) +∑ilog f (xi jγ).


log L(y , x jγ) = log L(y jx ,γ) +∑ilog f (xi jγ)

log L(y jx ,γ) = conditional log likelihoodAssumption: x is exogenous (it is determined independently of y) ) itsdistribution depends on a subset of the γ parameters, δ, and theconditional distribution of y jx depends on another subset, β.Then

log L(y , x jγ) = log L(y jx , β) +∑ilog f (xi jδ).

To estimate β, maximizing log L(y jx , β) is su¢ cient. This requires only tospeci�y the conditional density of y given x .


Example 2: In a Probit model, this is exactly what we have speci�ed: theconditional distribution of y given x . y can take only two values, withprobabilities: �

Pr(y = 1jx) = Φ(xβ)Pr(y = 0jx) = 1�Φ(xβ)

The conditional log-likelihood is:

log L(y jx , β) = ∑iyi logΦ(xi β) +∑

i(1� yi ) log [1�Φ(xi β)] .

Example 3: In the case of a continuous variable y , assume that

y jx � N(xβ, 1)

What does it imply for the CEF E (y jx)? Would OLS provide a validestimate of β? Show that maximizing the log-likelihood gives the sameresult as applying OLS.


Maximizing the log-likelihood in practiceUsually no closed-form solution ) algorithms that try di¤erent values ofthe vector β in order to �nd one such that

∂ log L(y jx , β)∂β

= 0.

= �rst-order conditions (FOC).There are also second-order conditions (SOC): matrix of second derivativesnegative de�nite (think of the case where β is a scalar). Algorithms usethe �rst derivatives (a vector called the score, or gradient) and the secondderivatives (a matrix called the Hessien) to arrive at a point where theFOC and the SOC are met.


Most of the times, things go well and we don�t care too much on how thealgorithm actually works (we type the �probit� command in Stata, and letStata maximize the log-likelihood with its favorite algorithm). But it isimportant to see potential problems. In particular, there may be severallocal maxima, and the algorithm might be stuck on a local maximum thatis not the global maximum. A case where the algorithm has no problem isthe case where the Hessien is negative de�nite everywhere. In that case,there is only one local maximum, which is the global maximum. (In thescalar case, this means that the log-likelihood is strictly concave).Therefore, ideally, when one uses ML, one would like to check if thelog-likelihood has the right properties.


The Logit is a relatively easy case to show that the (conditional)log-likelihood has the good properties.

log L(y jx , β) = ∑iyi logΛ(xi β) +∑

i(1� yi ) log [1�Λ(xi β)] ,

with Λ(u) = eu/(1+ eu).

We want to compute the gradient

g(β) =∂ log L(y jx , β)

∂β

and the Hessien

H(β) =∂g(β)

∂β0.

The computations are made easier by the fact that

∂Λ(u)∂u

=eu(1+ eu)� eueu

(1+ eu)2=

eu

(1+ eu)2

= Λ(u)(1�Λ(u)).


Then:

g(β) = ∑i

�yi

Λ(xi β)(1�Λ(xi β))Λ(xi β)

� (1� yi )Λ(xi β)(1�Λ(xi β))

1�Λ(xi β)

�x 0i

= ∑i[yi �Λ(xi β)] x 0i

andH(β) = �∑

iΛ(xi β)(1�Λ(xi β))x 0i xi .


Asymptotic properties of CMLEUnder some regularity conditions (OK for Probit and Logit)

1 Consistency: As sample size grows, the estimate comes closer andcloser to the true value:

p lim bβML = βo .

2 Asymptotic normality: The distribution of bβML tends toward anormal: p

n(bβML � βo )! N(0,V )

with V = �E [H(βo )]�1.3 Asymptotic e¢ ciency: bβML is asymptotically e¢ cient (it has thesmallest variance).

Note 1: These properties assume that we have properly speci�ed theconditional density. However, robust to minor speci�cation errors (if theCEF is well speci�ed).Note 2: Softwares directly give an estimate of the asymptotic variance,Avar(βML) =

1nbV . It is possible to ask for robust variances, for instance to

account for correlations across observations.Luc Behaghel (PSE) Binary outcomes January 2009 34 / 41

Interpreting the results of binary models

The coe¢ cients in the index are not directly interpretable. Key: estimatemarginal e¤ects:

∆T0 jC0 =∂Pr(y = 1jT = T0,C = C0)

∂T

=

8><>:∂(C βLP+αLPT )

∂T (C0,T0) = αLP (LPM)

∂Φ(C βP+αPT )∂T (C0,T0) = αPφ(C0βP + αPT0) (Probit)

∂Λ(C βL+αLT )∂T (C0,T0) = αLΛ(C0βL + αLT0)(1�Λ(C0βL + αLT0)) (Logit)

Rem: The marginal e¤ects have the same sign as the coe¢ cients. In theLPM, the coe¢ cient is the approximation to the marginal e¤ect. In theProbit and Logit cases, the coe¢ cient must be multiplied by a quantitythat depends on where we are evaluating the e¤ect.


Reporting the index coe¢ cients onlyEasiest solution. Drawback: the magnitude of a Probit or a Logitcoe¢ cient has no direct interpretation. But:

The sign of the marginal e¤ect is the same as the sign of the indexcoe¢ cient. So reporting the index coe¢ cient is enough if one is justinterested in the direction of the e¤ect (rarely the case ineconomics!).

The ratio of two index coe¢ cients is equal to the ratio of thecorresponding marginal e¤ects. E.g.�

∂Φ(CβP + αPT )∂T

(C0,T0)�

/�

∂Φ(CβP + αPT )∂C

(C0,T0)�=

αPβP

(and the same for the Logit). αPβP= 2 would thus mean that the

impact of T (on the probability that y = 1) is twice as large as theimpact of C . Note that this holds independently of where we measurethe marginal e¤ects (that is, which values we choose for C0 and T0).


Reporting marginal e¤ects for a reference individualCompute marginal e¤ects for a given individual, using the coe¢ cientestimates. Issue: choose a particular reference (C0, T0).

A �rst possibility is to consider a (�ctive) individual for whomPr(y = 1) = .5. In that case, in the Logit model, the multiplicativeterm is Pr(y = 1)(1� Pr(y = 1)) = .25, and the multiplicative termin the Probit model is φ(0) ' .40. In other words, to transform thecoe¢ cients into marginal e¤ects, we can multiply them by .25 (Logit)or .4 (Probit).

Another possibility is to choose the �average individual�, i.e. to set C0and T0 at the sample means. This is what software like Stata dowhen they compute marginal e¤ects (command �dprobit�). Note thatif C or T is a dummy variable, choosing C0 = C or T0 = T makeslittle sense, and it might be preferable to choose C0 = 0 or 1, andsame for T0.


Example: Back to the labor supply of women. Assume that T is anindicator variable for having a child aged less than 3 or not, and C is thewoman�s age. One could typically compute the marginal e¤ect of age for awoman of average age (C0 = C ) with no child under 3 (T0 = 0). Also, forthe impact of having a child under three, rather than a marginal e¤ect,one would compute

Pr(y = 1jC = C ,T = 1)� Pr(y = 1jC = C ,T = 0).

(Again, Stata can do the computation for you, using the appropriateoptions with the �dprobit� command).


Note 1: Standard errors for marginal e¤ects in Probit and Logit modelsare a complex, non linear functions of all the coe¢ cients ) compute alinear approximation of the standard error, using the Delta method (a�rst-order Taylor expansion). This is done by Stata.Note 2: The formulas for marginal e¤ects are only valid if the explanatoryvariables enter the index linearly. Counter-example: polynomial of age inthe labor supply equation. A software does not know that a variable is thesquare of another explanatory variable. So you cannot trust it, in thatcase, to derive the marginal e¤ects...


Reporting average marginal e¤ectsWhy report the marginal e¤ect of age for a woman with no child under 3rather than for a woman with children under 3?) usual practice: report average marginal e¤ect over the sample:

∆ =1n

n

∑i=1

∂Pr(y = 1jT = Ti ,C = Ci )∂T

.

Complicated, but computed by computers.


Test of coe¢ cient signi�cance ) t-statistic (same as OLS):

bβMLbσ(bβML) ! N(0, 1)

under the null hypothesis that β = 0.

Test of multiple restrictions ) frequently used approach: �likelihoodratio� test. Estimate the unrestricted and the restricted model (applyingthe constraints). Log-likelihoods log Lur and log Lr , respectively. Likelihoodratio statistic

LR = 2(log Lur � log Lr ).LR ! χ2(q) under the null hypothesis (q : number of constraints).


Documents

January 2009 - federation.ens.frfederation.ens.fr/wheberg/parischoeco/formation... · CEF when y is binary 1 The CEF is a ﬁCPFﬂ: a probability that takes value between 0 and 1