Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Multiple Choice Models
Laura Magazzini
University of Verona
[email protected]://dse.univr.it/magazzini
Laura Magazzini (@univr.it) Multiple Choice Models 1 / 72
Introduction
Qualitative Response Models
The general class of models we shall consider are those for which thedependent variable takes values 0, 1, 2, ...
In a few cases, the values will themselves be meaningful, for example
. Number of patents: y = 0, 1, 2, ... (count data)
In most cases, the value of the dependent variable is merely a codingfor some qualitative outcome:
. Labor force participation: we code “yes” as 1 and “no” as 0(qualitative choices)
. Occupational field: let 0 be clerk, 1 engineer, 2 lawyer and so on(categories)
. Opinions are usually coded over likert scales where 1 stands for“strongly disagree”, 2 “disagree”, 3 “neutral”, 4 “agree”, 5 “stronglyagree” (rankings)
Laura Magazzini (@univr.it) Multiple Choice Models 2 / 72
Introduction
Binary Choice Model: Example 1Performance in macroeconomic test (Ex. 21.3 in Greene from Spector and Mazzeo, 1980)
Sample of 32 students
. 14 subject to a new method of teaching economics, the PersonalizedSystem of Instruction (PSI )
Dependent variable, GRADE = 1 if the student’s grade in anintermediate macroeconomics course was higher than that in theprinciples course
PSI
GRADE 0 1 total
0 15 6 21
% 83.33 42.86 65.63
1 3 8 11
% 16.67 57.14 34.38
total 18 14 32
Laura Magazzini (@univr.it) Multiple Choice Models 3 / 72
Introduction
Binary Choice Model: Example 2Labor participation of married woman
Sample of 753 married woman
Dependent variable = 1 if the woman worked at least one hour in1975 (INLF75)
# of kids < 6 years
INLF75 0 1 2 3 total
0 231 72 19 3 325
% 38.12 61.02 73.08 100 43.16
1 375 46 7 0 428
% 61.88 38.98 26.92 0.00 56.84
total 606 118 26 3 753
Laura Magazzini (@univr.it) Multiple Choice Models 4 / 72
Introduction
General Framework
Regression analysis is not readily applicable to these situations
We can construct models that link the decision or outcome to a set offactors, at least in the spirit of the regression
Response probability
Pr(event j occurs|x) = Pr(Y = j |x) = F (x, β)
. x: relevant effects
. β: parameters
Laura Magazzini (@univr.it) Multiple Choice Models 5 / 72
Introduction
Choice set
Discrete choice models describe decision makers’ choices amongalternatives
. The decision makers can be people, households, firms, ...
. The alternatives might represent competing products, courses ofaction, ...
The choice set must have the following characteristics:
(1) mutually exclusive alternatives(2) exhaustive(3) finite number of alternatives
Laura Magazzini (@univr.it) Multiple Choice Models 6 / 72
Binary Choice Models
Binary Choice Models
The dependent variable can take two values, usually coded as 0/1,indicating whether or not a certain event has occurred
. joined the labor force, effective drug, bankrupt company
Regardless of the definition of y , it is traditional to refer to y = 1 assuccess and y = 0 as failure
3 approaches1 The regression approach2 Latent variables3 Random Utility Models
Laura Magazzini (@univr.it) Multiple Choice Models 7 / 72
Binary Choice Models The regression approach
The Regression Approach
In the linear regression model you have:
E [yi |Xi ] = β0 + β1X1i + β2X2i + ...+ βkXki = x ′iβ
yi = x ′iβ + εi
where β is a vector of unknown parameters and εi with mean zeroand homoschedastic
In the case of a binary dependent variable:
E [y |X ] = 0× Pr(y = 0|X ) + 1× Pr(y = 1|X ) = Pr(y = 1|X )
We model: Pr(y = 1|X ) = F (x ′β).
. Different choices for F leads to different models
. The set of parameters β is linked to the impact of changes in x on theprobability of success
Laura Magazzini (@univr.it) Multiple Choice Models 8 / 72
Binary Choice Models The regression approach
The Linear Probability Model (LPM)
Pr(yi = 1|X ) = x ′iβ
yi = x ′iβ + εi
Interpretation of coefficients does not change with respect to thelinear model, with the specification that E (y |X ) = Pr(y = 1|X ).
If xj is not functionally related to other explanatory variable, βjmeasures the effect of the explanatory variable xj on the probability ofsuccess:
βj =∂ Pr(y = 1|X )
∂xj
In the linear model, the xj can be functions of underlying variables(e.g. log, squared...): this simply change the interpretation of βj
However, it has a number of shortcomings...
Laura Magazzini (@univr.it) Multiple Choice Models 9 / 72
Binary Choice Models The regression approach
LPM: heterogeneity
ε is heteroschedastic
. If yi = 0, εi = −x ′i β with probability Pr(Yi = 0|xi ) = 1− x ′i β
. If yi = 1, εi = 1− x ′i β with probability Pr(Yi = 1|xi ) = x ′i β
. Therefore E (εi |xi ) = −x ′i β × (1− x ′i β) + (1− x ′i β)× x ′i β = 0
. But Var(εi |xi ) = E (ε2i |xi )− E (εi |xi )2 = ... = x ′i β × (1− x ′i β), i.e.
Var(εi |xi ) depends on xi. So, heteroschedasticity is present unless all the slope coefficients are
zero. The problem can be solved by using heteroskedasticity-robust standard
errors and t-tests (or by using GLS)
Laura Magazzini (@univr.it) Multiple Choice Models 10 / 72
Binary Choice Models The regression approach
LPM: predicted values as probabilities?
Pr(yi = 1|X ) = x ′iβ
yi = x ′iβ + εi
The most serious flaw is that we cannot be sure that the predictionsyi = x ′i β lie in the 0-1 interval
. yi cannot surely be interpreted as a probability
. The estimated Var(εi |xi ) can assume negative values
. However, the model often seems to provide good estimates of thepartial effects on the response probability near the center of thedistribution of the independent variables
Laura Magazzini (@univr.it) Multiple Choice Models 11 / 72
Binary Choice Models The regression approach
LPM: ceteris paribus increases
Pr(yi = 1|X ) = x ′iβ
yi = x ′iβ + εi
LPM assumes that probabilities increase linearly with the explanatoryvariables
Each unit increase in X has the same effect on the probability of Yoccurring regardless of the level of the X
This implication cannot literally be true as continually increasing oneof the X would eventually drive Pr(y = 1|X ) outside the unit interval
It’s more realistic to assume a smaller effect at the extreme values ofX
Laura Magazzini (@univr.it) Multiple Choice Models 12 / 72
Binary Choice Models The regression approach
Requirements on F : Probit & Logit
limx ′β→∞ F (x ′β) = 1
limx ′β→−∞ F (x ′β) = 0
Probit Model: F (x ′β) = Φ(x ′β) =∫ x ′β−∞ φ(ε)dε
where Φ represents the cumulative distribution function of theN(0, 1), and φ its density function
Logit Model: F (x ′β) = exp(x ′β)1+exp(x ′β)
Laura Magazzini (@univr.it) Multiple Choice Models 13 / 72
Binary Choice Models The regression approach
Which one to use?
The two distribution tend to givesimilar results for intermediatevalues of x ′β
The logistic distribution tends togive higher probabilities to y = 0if x ′β is extremely small
It is difficult to justify the choiceof one distribution on theoreticalgrounds
Historically, logit was preferreddue to computational reasons
Laura Magazzini (@univr.it) Multiple Choice Models 14 / 72
Binary Choice Models Latent variable approach
Latent Variable approach
Example: Work/Not work can be seen as the result of an underlyinglatent variable (unobserved and unobservable) such as the differencebetween reservation wage and market wage. We only observe whetherthe individual works or not, not the underlying comparison amongreservation and market wages
The latent variable is defined as y∗i = x ′iβ + εiWe observe:
. yi = 1 if y∗i > c
. yi = 0 if y∗i ≤ c
Logit and Probit are obtained by different assumptions on ε
If the interest is in y∗, β can be interpreted as the change in thelatent variable corresponding to a change in X : rarely the case...
Laura Magazzini (@univr.it) Multiple Choice Models 15 / 72
Binary Choice Models Latent variable approach
Probit: ε ∼ N(0, σ2)
P(y = 1|x) = P(y∗ > c |x) = P(x ′β + ε > c |x)
= P(ε > c − x ′β|x)
First identification problem: let c = 0Innocent normalization if the model contains a constant term
= P(ε > −x ′β|x) due to the simmetry of N
= P(ε < x ′β|x)
= Φ
(x ′β
σ
)Second identification problem: let σ = 1Innocent normalization: there’s no information about σ in the data
Laura Magazzini (@univr.it) Multiple Choice Models 16 / 72
Binary Choice Models Latent variable approach
Summing up
y∗i = x ′iβ + εi
yi = 1 if y∗i > 0
yi = 0 if y∗i ≤ 0
If ε is assumed N(0, 1): Probit model
If ε is assumed standardized logistic (with known variance equal toπ2/3): Logit model
Laura Magazzini (@univr.it) Multiple Choice Models 17 / 72
Binary Choice Models Random Utility Models
Random Utility ModelsMain references
Discrete choice models describe the choice of agents among a set ofalternative
. For the moment, we consider choice among 2 alternatives
Any agent can be considered:
. individual (consumer, commuter, household, ...)
. firm
. other...
References:
. Thurston (1927), Psychological Review
. First economic application by Luce and Duncan (1959), IndividualChoice Behavior
. Some applications:
- Transportation choice: McFadden (1974), Train (2003)- Occupational choice: Schmidt and Strauss (1975a,b)
Laura Magazzini (@univr.it) Multiple Choice Models 18 / 72
Binary Choice Models Random Utility Models
Random Utility ModelsStructural Approach
If we are willing to assume rational behaviour: the agent chooses thealternative that maximize his/her utility. The utility is known to the agent, but not to the econometrician. You can think of utility as a latent variable
Let us consider the choice among two alternatives (exhaustive andmutually exclusive)C1: Ui1 = x ′i1β1 + εi1C2: Ui2 = x ′i2β2 + εi2We observe yi = 1 if Ui1 > Ui2, otherwise yi = 0Different hypothesis on εi leads to different models:
Pr(y = 1) = Pr(Ui1 > Ui2)
= Pr(x ′i1β1 + εi1 > x ′i2β2 + εi2)
= Pr(εi1 − εi2 > x ′i2β2 − x ′i1β1)
If εi1, εi2 ∼ N: Probit modelIf εi1, εi2 ∼ EV 1 (extreme value of first type): Logit model
Laura Magazzini (@univr.it) Multiple Choice Models 19 / 72
Binary Choice Models Estimation and Inference in Binary Choice Models
Likelihood function
Estimation of binary choice models is usually based on the method ofmaximum likelihood (with the exception of the LPM)
y1, ..., yN : sample of size N from a population with density f (y |θ)
. θ: vector of parameters that fully characterizes the distribution of y inthe population
. Inference: we use the observed sample to get information about thevalue of θ in the population
The joint density of N independent and identically distributedobservation from the data generating process f (y |θ) is the likelihoodfunction:
f (y1, ..., yN |θ) =N∏i=1
f (yi |θ) = L(θ|y)
. Note that we write L(θ|y): the interest is on the parameters and theinformation about them that is contained in the observed data
Laura Magazzini (@univr.it) Multiple Choice Models 20 / 72
Binary Choice Models Estimation and Inference in Binary Choice Models
Identification
The issue of identification must be resolved before estimation can beeven considered
Suppose we had an infinitely large sample, could we uniquelydetermine the value of θ from such as sample?
The parameter vector θ is identified (estimable) if for any otherparameter vector θ∗ 6= θ, for any data y1, ..., yN , L(θ∗|y) 6= L(θ|y)
Example: recall the case of probit where we set σ = 1
Laura Magazzini (@univr.it) Multiple Choice Models 21 / 72
Binary Choice Models Estimation and Inference in Binary Choice Models
How to employ the likelihood function to estimate θ?
Principle of maximum likelihood: an event occurred because it wasmost likely toExample: 10 independent draws from a Bernoulli distribution(0, 1, 1, 0, 0, 1, 0, 0, 0, 1). The parameter θ is the (population) probability of success
The density for each observation is f (yi |θ) = θyi (1− θ)1−yi
The joint density, i.e. the likelihood for this sample is
L(θ|y) =N∏i=1
[θyi (1− θ)1−yi ] = θ4(1− θ)6
L(θ|y) is the “probability” of observing this particular sample,assuming that a Bernoulli distribution with as yet unknown parameterθ generated the data. The reference to the “probability” of observing the given sample is not
exact in a continuous distribution, since a particular sample hasprobability zero (nonetheless, the principle is the same)
Laura Magazzini (@univr.it) Multiple Choice Models 22 / 72
Binary Choice Models Estimation and Inference in Binary Choice Models
10 Bernoulli draws: 0, 1, 1, 0, 0, 1, 0, 0, 0, 1
The function has a single mode at θ = 0.4, which would be themaximum likelihood estimate (MLE) of θ
Since the log function is monotonically increasing and easier to workwith, ln L(θ|y) is usually maximized instead
Laura Magazzini (@univr.it) Multiple Choice Models 23 / 72
Binary Choice Models Estimation and Inference in Binary Choice Models
Properties of MLE θAssuming that regularity conditions hold
Consistency: p lim θ = θ0
Asymptotic normality: θd→ N(θ0, I (θ0)−1), where
I (θ0) = −E0
[∂2 ln L∂θ0∂θ′0
]Asymptotic efficiency: θ achieves the Cramer-Rao lower bound
Invariance: The MLE of γ0 = c(θ0) is c(θ) if c(θ0) is a continuousand continuously differentiable function
Laura Magazzini (@univr.it) Multiple Choice Models 24 / 72
Binary Choice Models Estimation and Inference in Binary Choice Models
Probit & Logit: Likelihood Function
Estimation of binary choice models is usually based on the method ofmaximum likelihood (with the exception of the linear probabilitymodel)
Each observation is treated as a single draw from a Bernoullidistribution with probability of success Pr(yi = 1) = F (x ′iβ) = Fi
In case of independent observations, the likelihood function can bewritten as:
L(β|y) = Pr(Y1 = y1, ...,Yn = yn)
=∏yi=1
Fi ×∏yi=0
(1− Fi ) =n∏
i=1
F yii (1− Fi )
1−yi
Taking log: ln L(β|y) =∑n
i=1 yi lnFi + (1− yi ) ln(1− Fi )
Laura Magazzini (@univr.it) Multiple Choice Models 25 / 72
Binary Choice Models Estimation and Inference in Binary Choice Models
Maximum Likelihood Estimation
ln L(β|y) =n∑
i=1
yi lnFi + (1− yi ) ln(1− Fi )
First order conditions: system of k equations in k unknowns
∂ ln L(β|y)
∂β=
n∑i=1
(yi
fiFi
+ (1− yi )−fi
1− Fi
)xi = 0
where fi = ∂Fi∂x ′i β
Nonlinear likelihood equations (except in the case of linear probabilitymodel): iterative solutions neededFrom general ML theory:√n(β − β)
d→ N(0,A−1) where A−1 = limn→∞1nE(∂2 ln L∂β ∂β′
). In most cases the inverse exists, and when it does, A is positive definite. If the matrix is not invertible, then perfect collinearity probably exists
among the regressors
Laura Magazzini (@univr.it) Multiple Choice Models 26 / 72
Binary Choice Models Estimation and Inference in Binary Choice Models
Maximum Likelihood Estimation (1)Probit and Logit
In the case of probit and logit, the likelihood function is globallyconcave (the Hessian is always negative definite)
As a consequence, a unique solution exists to the maximizationproblem
In the case of logit, the FOC can be written as:
∂ ln L
∂β=
n∑i=1
(yi − Fi )xi = 0
. Orthogonality condition between generalized residuals and independentvariables
. If the model contains the constant term:n∑
i=1
yi − Fi = 0⇔n∑
i=1
yin− Fi
n= 0⇔
n∑i=1
yin
=n∑
i=1
Fi
n
Laura Magazzini (@univr.it) Multiple Choice Models 27 / 72
Binary Choice Models Estimation and Inference in Binary Choice Models
Maximum Likelihood Estimation (2)Probit and Logit
In the case of probit:
∂ ln L
∂β=
n∑i=1
λix′i = 0
where
λi =
φ(x ′i β)
Φ(x ′i β) if yi = 1−φ(x ′i β)
1−Φ(x ′i β) if yi = 0
Laura Magazzini (@univr.it) Multiple Choice Models 28 / 72
Binary Choice Models Estimation and Inference in Binary Choice Models
Robust estimation?Probit model
A number of applications based on the probit model make use ofWhite robust “sandwich” estimator
If the model is correctly specified (i.e. homoschedasticity holds),applying the “sandwich” estimator does not change the estimate ofthe variance covariance matrix
The presence of heteroschedasticity entirely changes the functionalform for Pr(y = 1|X ): if σ2
i 6= σ2 then Pr(y = 1|X ) 6= Φ(x′β)
As a result probit will be inconsistent for β when the error term isheteroschedastic!!!
Laura Magazzini (@univr.it) Multiple Choice Models 29 / 72
Binary Choice Models Estimation and Inference in Binary Choice Models
Marginal Effects
Recall the linear regression model: yi = x ′iβ + εi. βj represents the change in yi due to a unit change in xij , ceteris
paribus:
βj = ∂E(yi |xi )∂xij
In the logit and probit models: E (yi |xi ) = Pr(yi = 1|xi ) represents theprobability of success∂E(yi |xi )∂xij
is called marginal effect and represents the change in the
probability of success due to a unit change in the independentvariable xj
∂E (yi |xi )∂xij
=∂ Pr(yi = 1|xi )
∂xij= f (x ′iβ)βj
where f (z) = ∂F∂z
Laura Magazzini (@univr.it) Multiple Choice Models 30 / 72
Binary Choice Models Estimation and Inference in Binary Choice Models
Marginal Effects: Probit & Logit
Probit: ∂E(yi |xi )∂xij
= φ(x ′iβ)βj
Logit: ∂E(yi |xi )∂xij
= Fi (1− Fi )βj
The β coefficients do not have a direct interpretation
. Knowing the sign of βj is enough to determine whether the variable xjhas a positive or negative effect
. But to find the magnitude of the effect, we have to take into accountthe value of all the explanatory variables
Marginal effect of the “average” person vs. average of individualmarginal effect
. If you have dummy variable(s) in your regression, then its average doesnot correspond to any individual: it make more sense to consider 0 or 1
. To compute the marginal effect for dummy variables, take thedifference Pr1(x)− Pr0(x)
Laura Magazzini (@univr.it) Multiple Choice Models 31 / 72
Binary Choice Models Estimation and Inference in Binary Choice Models
Marginal Effects: Standard errors
The Delta-Method can be employed to compute the standard error of themarginal effect:
Let θ be an estimate of the parameter of interest. For convenience let θ be a scalar
IF√n(θ − θ0)
d→ N(0, σ2)
THEN√n(g(θ)− g(θ0))
d→ N(0, g ′(θ)2σ2). In the multiparametric case, the jacobian of g with respect to θ is
employed
In our case:
Let Σ be the (asymptotic) variance-covariance matrix of thecoefficients
The (asymptotic) variance-covariance matrix of the marginal effectscan be computed as
∂ME
∂β′Σ∂ME
∂β
Laura Magazzini (@univr.it) Multiple Choice Models 32 / 72
Binary Choice Models Estimation and Inference in Binary Choice Models
Testing hypothesis about your coefficients
We consider MLE of a parameter θ and a test of the hypothesisc(θ) = q
. Likelihood ratio test: compares LU (value of the likelihood function atthe unconstrained value of θ) and LR (value of the likelihood functionat the restricted estimate). If the restriction is valid, then imposing itshould not lead to a large reduction in the log-likelihood function
. Wald test: If the restriction is valid, then c(θMLE ) should be close to q
. Lagrange Multiplier test: The test is based on the slope of thelog-likelihood at the point where the function is maximized subject tothe restriction. If valid, the slope of the log-likelihood function shouldbe near zero at the restricted estimator
The three tests are asymptotically equivalent under the nullhypothesis, but can behave rather differently in a small sample
When applied to linear models, W ≥ LR ≥ LM
Laura Magazzini (@univr.it) Multiple Choice Models 33 / 72
Binary Choice Models Estimation and Inference in Binary Choice Models
Testing hypothesis about your coefficientsH0: c(θ) = q, with q of size k × 1
Likelihood ratio test
LR = −2(ln LR − ln LU) = −2 lnLR
LU∼ χ2
k
Wald test
W = [c(θ)− q]′(Asy .Var [c(θ)− q])−1[c(θ)− q] ∼ χ2k
Lagrange Multiplier test (or score test)
LM =
(∂ ln L(θR)
∂θR
)′[I (θR)]−1
(∂ ln L(θR)
∂θR
)∼ χ2
k
Laura Magazzini (@univr.it) Multiple Choice Models 34 / 72
Binary Choice Models Estimation and Inference in Binary Choice Models
Three asymptotically equivalent test proceduresH0: c(θ) = 0
Laura Magazzini (@univr.it) Multiple Choice Models 35 / 72
Binary Choice Models Goodness of fit
How does your model fit the data?
Compare your model with the null model
H0: β = 0 can be tested with −2[ln L0 − ln L]d→ χ2
k−1
pseudo-R2 = 1− ln Lln L0
Various other measures have been proposed in the literature, but nostandard procedure has been agreed yet
In order to compute L0 no maximization is needed: the MLE isobtained from the share of successes in the sample
Laura Magazzini (@univr.it) Multiple Choice Models 36 / 72
Binary Choice Models Goodness of fit
A note on pseudo-R2 = 1− ln Lln L0
pseudo-R2 = 0 when ln Lln L0
= 1
. poor fit of the model
In order to have pseudo-R2 = 1: ln L = 0⇔ L = 1
. This can only happen when Fi = 1 if yi = 1 and Fi = 0 if yi = 0
. BUT 0 < Fi < 1 (equal 0 or 1 at the limit)
. Therefore pseudo-R2 < 1
Laura Magazzini (@univr.it) Multiple Choice Models 37 / 72
Binary Choice Models Goodness of fit
Share of correctly predicted
A useful summary of the predictive ability of the model is a 2× 2table of the hits and misses of a prediction rule such as
y = 1 if F > 0.5 and 0 otherwise
Do not place too much emphasis on this measure of goodness of fit
. Let P be the observed probability of success (share of 1 in yoursample). The naive predictor
y = 1 if P > 0.5 and 0 otherwise
would always predict correctly 100P percent of the observation. The naive model does not have a zero fit. Compute the percent correctly predicted for each outcome, y = 0 and
y = 1
Laura Magazzini (@univr.it) Multiple Choice Models 38 / 72
Binary Choice Models Goodness of fit
Example: Labor participation of married womenExample 15.2 from Wooldridge
Laura Magazzini (@univr.it) Multiple Choice Models 39 / 72
Binary Choice Models Goodness of fit
Specification issues in probit models
Neglected heterogeneity
Omitted variables
. These problems are relevant for all index models
. Since the normal distribution allows us to obtain concrete results, thefocus is on probit
In linear models:
. Heterogeneity causes OLS to be inefficient, however it is still consistent
. Omitted variables can lead to inconsistent estimates, unless...
- the omitted variable does not affect y- the omitted variable is uncorrelated with x
Laura Magazzini (@univr.it) Multiple Choice Models 40 / 72
Binary Choice Models Goodness of fit
Neglected heterogeneity (1)
The (structural) model of interest is: Pr(y = 1|X ) = Φ(x ′β + γc)
. In a latent variable form: y∗ = x ′β + γc + ε, where y = 1[y∗>0]
. The interest is in the partial effect of the xj on the probability ofsuccess, holding c and the other elements of X constant
Assume c independent of X and c ∼ N(0, τ2)
Then, γc + ε independent of X and has a normal distribution withmean zero and variance σ2 = γ2τ2 + 1
Pr(y = 1|X ) = Pr(γc + ε > −x ′β|X ) = Φ(x ′β/σ)
Probit of y on X consistently estimates β/σ
As σ = (γ2τ2 + 1)1/2 > 1, estimates suffer of attenuation bias:|βj/σ| < |βj |
Laura Magazzini (@univr.it) Multiple Choice Models 41 / 72
Binary Choice Models Goodness of fit
Neglected heterogeneity (2)
However, interest is in
∂ Pr(y = 1|X )/∂xj = βjφ(x ′β + γc)
What we can consistently estimate is βj/σφ(x ′β/σ). Different from ∂ Pr(y = 1|X )/∂xj evaluated at c = 0 (not really
informative...). Equal to the average partial effect:
E [βjφ(x ′β + γc)] =βjσφ
(x ′β
σ
)Omitted heterogeneity is not a problem in probit when it isindependent of X (average partial effects can be consistentlyestimated)
However, if c is correlated with X (or otherwise dependent on X ),then the omission of c is serious: we cannot get consistent estimatesof APE
Laura Magazzini (@univr.it) Multiple Choice Models 42 / 72
Binary Choice Models Goodness of fit
Continuous endogenous explanatory variable
One possibility: 2SLS applied to LPM
To apply probit estimation, stronger assumptions are needed
In the following model assume normally distributed error terms:
y∗1 = z1δ1 + α1y2 + u1, where y1 = 1[y∗1 >0]
y2 = z1δ21 + z2δ22 + u2 = zδ2 + u2
If u1 and u2 are independent, no endogeneity problem exists
On the contrary if y2 is correlated with u1, APE can be estimated
. Normalization of the variance of u1 is needed (σ2u1
= 1)
Under the assumption of joint normality, full maximum likelihood canbe employed for estimation
Laura Magazzini (@univr.it) Multiple Choice Models 43 / 72
Binary Choice Models Goodness of fit
Continuous endogenous explanatory variableTwo-step approach (Rivers and Vuong, 1988)
As (u1, u2) are jointly normal and correlation exists among the twoterms, we can write:
u1 = θ1u2 + e1 =cov(u1, u2)
var(u2)u2 + e1
Because of joint normality of (u1, u2), e1 is also normal with meanzero and variance Var(u1)− θ2
1Var(u2) = 1− θ21
If u2 would be observable, APE could be estimated from the model:
y∗1 = z1δ1 + α1y2 + u1 = z1δ1 + α1y2 + θ1u2 + e1
Two-step approach(a) Run OLS of y2 on z : you get a consistent estimate of the residuals u2
(b) Run probit of y1 on z1, y2, and u2
Test of exogeneity – H0: θ1 = 0
Laura Magazzini (@univr.it) Multiple Choice Models 44 / 72
Binary Choice Models Goodness of fit
Binary endogenous variable
y1 = 1[z1δ1+α1y2+u1>0]
y2 = 1[z2δ1+u2>0]
The average treatment effect is Φ(z1δ1 + α1)− Φ(z1δ1)
Under normality assumption, the joint distribution of (y1, y2) isspecified: maximum likelihood can be applied for estimation
Rivers-Vuong approach can be employed to test for endogeneity
Alternatively a score test for H0: cov(u1, u2) = 0 can be applied (itdoes not require estimation of the full model)
Laura Magazzini (@univr.it) Multiple Choice Models 45 / 72
Categorical variable models
Categorical data
Y is the result of a single decision among more than 2 alternatives
Unordered choice set: Categories/Qualitative choices
. multinomial logit, conditional logit, nested logit
Ordered choice set (rankings): models for ordered data
. ordered probit
Laura Magazzini (@univr.it) Multiple Choice Models 46 / 72
Categorical variable models Models for categories/qualitative choices
Example: Education and Occupational Choice
EducationPrimary/Secondary University
Occupation School or more Total
Menial 23 (74.19%) 8 (25.81%) 31 (100%)Blue Collar 60 (86.96%) 9 (13.04%) 69 (100%)Craft 65 (77.38%) 19 (22.62%) 84 (100%)WhiteCol 27 (65.85%) 14 (34.15%) 41 (100%)Prof 27 (24.11%) 85 (75.89%) 112 (100%)Total 202 (59.94%) 135 (40.06%) 337 (100%)
Laura Magazzini (@univr.it) Multiple Choice Models 47 / 72
Categorical variable models Models for categories/qualitative choices
Multinomial distribution
Yi : qualitative random variable with J categories
Pij = Pr(Yi = j), j = 1, 2, ..., J
. Probability that individual i will choose alternative j
Categories are mutually exclusive and exaustive:∑j
Pij = 1, i = 1, 2, ...,N
Let di = (di1, di2, ..., diJ), where dij = 1 if Yi = j
.∑
j dij = 1, i = 1, 2, ...,N
Laura Magazzini (@univr.it) Multiple Choice Models 48 / 72
Categorical variable models Models for categories/qualitative choices
Multinomial logit model (MNL)
Y : result of a choice among J alternatives (J > 2)
di = (di1, di2, ..., diJ), where dij = 1 if Yi = j
Pij = Pr(Yi = j),∑
j Pij = 1
Logit model:
Pr(Yi = j) =exp(ηij)∑Jl=1 exp(ηil)
Laura Magazzini (@univr.it) Multiple Choice Models 49 / 72
Categorical variable models Models for categories/qualitative choices
Properties of MNL
0 ≤ Pij ≤ 1∑j Pij = 1 (by definition)
For every pair of alternatives (k , l), the probability ratio is
Pik
Pil=
exp(ηik)
exp(ηil)⇒ log
Pik
Pil= ηik − ηil
The model can be motivated by a random utility model
Laura Magazzini (@univr.it) Multiple Choice Models 50 / 72
Categorical variable models Models for categories/qualitative choices
Random Utility Models (1)McFadden (1973, 2001)
J alternatives: mutually exclusive, exhaustive, finite set
. Examples: competing brands, different means of transport, differentoccupations, ...
Categories can be ordered or unordered
. Different tecniques will be employed according to the nature of thealternatives
. Assume non-ordered alternatives
Rational agent chooses the alternative that maximizes his/her utility:Yi = j if Uij > Uik for each k 6= j
Laura Magazzini (@univr.it) Multiple Choice Models 51 / 72
Categorical variable models Models for categories/qualitative choices
Random Utility Models (2)McFadden (1973, 2001)
Linear utility model: Uij = ηij + εij with ηij = LC (zij , θ). ηij links the agent utility to factors that can be observed. ηij is different from Uij since there are factors that cannot be observed
by the researcher
Pr(Yi = j) = Pr(Uij > Uik , ∀k 6= j)
= Pr(ηij + εij > ηik + εik , ∀k 6= j)
= Pr(εik − εij < ηij − ηik , ∀k 6= j)
=
∫εI(εik−εij<ηij−ηik ,∀k 6=j)f (ε)dε
with f probability density function of ε
The model is made operational by a particular choice of distributionfor the disturbance. Closed functional forms exist only for few specifications (e.g. logit)
Laura Magazzini (@univr.it) Multiple Choice Models 52 / 72
Categorical variable models Models for categories/qualitative choices
How to specify ηij?
‘Standard’ MNL
. ηij = x′iβj
. x individual characteristics, constant across all the alternatives j
Conditional logit model
. ηij = z′ijγ
. zij characteristics of the choice j and individual i
- Datasets typically analyzed by economists do not contain mixtures ofindividual and choice-specific attributes
- CLM is usually applied when the interest is in the effect ofchoice-specific attributes
- Custom transformation is needed for variables containingindividual-specific attributes
Laura Magazzini (@univr.it) Multiple Choice Models 53 / 72
Categorical variable models Models for categories/qualitative choices
‘Standard’ MNL
Pr(Yi = j |xi ) =exp(x′iβj)∑Jl=1 exp(x′iβl)
It is not possible to estimates all the β1, ..., βJ
By adding a constant to all the βs, the probability doesn’t change
Indeterminacy in the model is removed by letting β1 = 0
J = 1 is the reference category
Pr(Yi = j |xi ) =exp(x′iβj)
1 +∑J
l=2 exp(x′iβl)
Intercept in the model is allowed by letting the first column of xi = 1for every i
Laura Magazzini (@univr.it) Multiple Choice Models 54 / 72
Categorical variable models Models for categories/qualitative choices
Estimation: MLE
The log likelihood can be written as
ln L =n∑
i=1
J∑j=1
dij ln Pr(Yi = j)
. with dij = 1 if Yi = j , 0 otherwise
The derivatives have the characteristically simple form:
∂ ln L
∂βj=∑i
(dij − Pij)xi = 0
As a consequence, if the model is estimated with an intercept,∑i dij =
∑i Pij = 1
Laura Magazzini (@univr.it) Multiple Choice Models 55 / 72
Categorical variable models Models for categories/qualitative choices
Interpretation of the parameters
The partial effects for this model are complicated:
∂Pj
∂xi= Pj
[βj −
J∑k=1
Pkβk
]= Pj [βj − β]
. The coefficients in this model are difficult to interpret:∂Pj/∂xk need not have the same sign as βjk
A simpler interpretation by considering the odds ratio:
. lnPij
Pi1= x′iβj
. lnPij
Pik= x′i (βj − βk) if k 6= 1
In case of dummy variables (coded as 0 or 1)
. lnPij(xi =1)
Pi1(xi =1)= βj
. lnPij(xi =1)
Pik(xi =1)= βj − βk if k 6= 1
Laura Magazzini (@univr.it) Multiple Choice Models 56 / 72
Categorical variable models Models for categories/qualitative choices
Conditional logit model
Pr(Yi = j |zj) =exp(zjβ)∑J
k=1 exp(zkβ)
The model contains choice-specific attributes
The coefficients of individual-specific attributes (that do not varyacross categories) are not identified
. Individual-specific variable can be inserted in the model, but need to beproperly transformed
All the coefficients of the choice-specific attributes cannot beseparately identified: adding a constant to all the coefficients doesnot change the estimated probability
. The intercept is set to zero
Laura Magazzini (@univr.it) Multiple Choice Models 57 / 72
Categorical variable models Models for categories/qualitative choices
Marginal effects
∂Pj(z)
∂zk= βk [Pj(z)(I(j=k) − Pk(z))]
∂Pj (z)∂zj
= βz [Pj(z)(1− Pj(z))]
∂Pj (z)∂zh
= −βzPj(z)Ph(z) (j 6= h)
Pj change monotonically with respect to z
The sign of the derivative depends on the sign of βz
Opposite effect by considering zj or zh
Simmetry:∂Pj
∂zh= ∂Ph
∂zj
Pj does not change if all the variables zkh change in the samedirection (the ranking of Uij is unchanged!)
Laura Magazzini (@univr.it) Multiple Choice Models 58 / 72
Categorical variable models Models for categories/qualitative choices
Multinomial logit (MNL) vs conditional logit (CNL)
Similar response probabilities, but they differ in some importantrespects
MNL: the conditioning variables do not change across alternatives. Characteristics of the alternatives are unimportant or not of interest, or
data are not available. Example: occupational choice – we do not know how much someone
could make in every occupation. We can collect data on factors affecting individual productivity and
tastes, e.g education, past experience. MNL: factors can have different effects on relative probabilities
(different βj for different choices)
CNL: choices on the basis of observable attributes of each alternative. Common β
MNL as a special case of CNL
Important limitation: independence from irrelevant alternativesassumption
Laura Magazzini (@univr.it) Multiple Choice Models 59 / 72
Categorical variable models Models for categories/qualitative choices
Independence from irrelevant alternatives (logit)
For every pair of alternatives (k , l), the probability ratio (odd) is
ω =Pr(Yi = k|xik)
Pr(Yi = l |xil)=
exp(ηik)
exp(ηil)
ω depends only on the linear predictors (η) of the consideredalternatives, not on the whole set of alternatives
From the point of view of estimation, it is useful that the odds ratiodoes not depend on the other choices
But it is not a particularly appealing restriction to place on consumerbehaviour
Laura Magazzini (@univr.it) Multiple Choice Models 60 / 72
Categorical variable models Models for categories/qualitative choices
IIA: example by McFadden (1984)
Commuters initially choosing between cars and red buses with equalprobabilities
Suppose a third mode (blue buses) is added and commuters do notcare about the colur of the bus (i.e. will chose between these withequal probability)
IIA imply that the fraction of commuters taking a car would fall from12 to 1
3 , a result that is not very realistic
Laura Magazzini (@univr.it) Multiple Choice Models 61 / 72
Categorical variable models Models for categories/qualitative choices
Testing IIAHausman and McFadden (1984)
If a subset of the choice set is truly irrelevant, omitting it from themodel altogether will not change the parameter estimatessistematically
Exclusion of these choices will be inefficient but will not lead toinconsistency
But if the remaining odds are not truly independent from thesealternatives, then the parameter estimates obtained when thesechoices are included will be inconsistent
Therefore, Hausman’s specification test can be applied
Laura Magazzini (@univr.it) Multiple Choice Models 62 / 72
Categorical variable models Models for categories/qualitative choices
The Hausman’s specification test
Consider two different estimators θE and θI
Under H0, θE and θI are both consistent and θE is efficient relative toθI
Under H1, θI remains consistent while θE is inconsistent
Then H0 can be tested by using the Hausman statistics:
H = (θI − θE )′[Est.Asy .Var(θI − θE )]−1(θI − θE )
= (θI − θE )′[Est.Asy .Var(θI )− Est.Asy .Var(θE )]−1(θI − θE )d→ χ2
J
The appropriate degree of freedom for the test will depend on thecontext
In the case of MNL, J is the number of parameter in the estimatingequation of the restricted choice set
Laura Magazzini (@univr.it) Multiple Choice Models 63 / 72
Categorical variable models Models for categories/qualitative choices
What if IIA hypothesis is not satisfied?(1) Multivariate probit model
Uj = β′xj + εj , j = 1, ..., J, [ε1, ε2, ..., εJ ] ∼ N(0,Σ)
Pr(Yi = j) = Pr(Uj > Uk , j = 1, 2, ..., J, ∀k 6= j)
Main obstacle: difficulty in computing the multivariate normalprobability for any dimensionality higher than 2
Recent advances in accurate simulations of multinormal integrals havemade estimation of MNP more feasible
. Simulation-based estimation
Laura Magazzini (@univr.it) Multiple Choice Models 64 / 72
Categorical variable models Models for categories/qualitative choices
What if IIA hypothesis is not satisfied?(2) Generalized extreme value: Nested logit models
Very appealing if it is possible to assume sequential choicesThe J alternatives are grouped into L subgroups:(1) First the group of alternative is chosen(2) Then, one alternative is chosen within the group
IIA is maintained within groups, but does not need to hold acrossgroupsMain limitations. Results can depend on the way in which groups are formed.... There is no specification test to discriminated among different
groupingsLaura Magazzini (@univr.it) Multiple Choice Models 65 / 72
Categorical variable models Treatment of rankings
Ordered data
Y can assume a limited number of categories yc , c = 0, 1, ...,C
Categories are inherently ordered: y0 < y1 < y2 < yCExamples:
. Bond rating: AAA-D
. Symptoms: none, minor, serious
. Drug effect: worsen, none, partial recovery, full recovery
. Customer satisfaction: very unsatisfied, unsatisfied, satisfied, verysatisfied
. ...
Ordered probit and logit models
. Multinomial models would fail to account for the ordinal nature of thedependent variable
. OLS would attach a meaning to the difference between the categorycodings
Laura Magazzini (@univr.it) Multiple Choice Models 66 / 72
Categorical variable models Treatment of rankings
Latent regression
We consider a continuous latent variable y∗ (unobserved), linearfunction of x and ε: y∗ = x ′β + εWe observe y = c ⇔ γc < y∗ ≤ γc+1, with γ0 = −∞ e γC+1 = +∞The latent response is specified by a linear regression model withoutthe intercept
Laura Magazzini (@univr.it) Multiple Choice Models 67 / 72
Categorical variable models Treatment of rankings
Ordered Probit Modely∗ = x ′β + ε with ε ∼ N(0, 1)
Pr(yi = 0|x) = Pr(y∗i ≤ γ1) = Pr(εi ≤ γ1 − x ′β|x) = Φ(γ1 − x ′β)Pr(yi = 1|x) = Pr(γ1 < y∗i ≤ γ2) = Φ(γ2 − x ′β)− Φ(γ1 − x ′β)...Pr(yi = C |x) = Pr(y∗i > γC ) = 1− Φ(γC − x ′β)
Usually y∗ has no real meaning
The interest is in Pr(y |x) rather than E (y∗|x)
To identify the parameters: x cannot contain the intercept
. If you have to specify a model with an intercept, set γ1 = 0
Laura Magazzini (@univr.it) Multiple Choice Models 68 / 72
Categorical variable models Treatment of rankings
Marginal effects
Coefficients are difficult to interpret:∂ Pr(yi=0|x)
∂xj= −βjφ(γ1 − x ′β)
. sign opposite to the sign of βj
∂ Pr(yi=c|x)∂xj
= βj [φ(γc+1 − x ′β)− φ(γc − x ′β)]
. ambiguous sign!!!
∂ Pr(yi=C |x)∂xj
= βjφ(γC − x ′β)
. same sign as βj
Laura Magazzini (@univr.it) Multiple Choice Models 69 / 72
Categorical variable models Treatment of rankings
Changes in y and y ∗ in response to changes in x
Increasing one of the x ’s while holding β and γ constant is equivalentto shifting the distribution of y∗ to the right (solid to dashed curve)
Laura Magazzini (@univr.it) Multiple Choice Models 70 / 72
Categorical variable models Treatment of rankings
Ordered Logistic Regression: εi ∼ logisticaProportional odds model
Pr(yi > c) =exp(x ′i β−γc )
1+exp(x ′i β−γc )
log(
Pr(yi>c)1−Pr(yi>c)
)= x ′iβ − γc
Pr(yi>c)/[1−Pr(yi>c)]Pr(yj>c)/[1−Pr(yj>c)] = exp[(xi − xj)
′β]
. Doesn’t depend on the threshold
Laura Magazzini (@univr.it) Multiple Choice Models 71 / 72
Categorical variable models Treatment of rankings
Ordered Probit vs. Ordered Logit
Coefficients and threshold parameters are different due to differentscale factors (σprobit = 1, whereas σlogit = π2/3)
Predicted probabilities are similar
Marginal effects are similar
If the logit is chosen, estimated coefficients can be interpreted interms of odds
Laura Magazzini (@univr.it) Multiple Choice Models 72 / 72