Multiple Choice Models - phdeconomics.sssup.it · -Transportation choice: McFadden (1974), Train (2003)-Occupational choice: Schmidt and Strauss (1975a,b) Laura Magazzini (@univr.it)

Multiple Choice Models

Laura Magazzini

University of Verona

[email protected]://dse.univr.it/magazzini

Laura Magazzini (@univr.it) Multiple Choice Models 1 / 72

Introduction

Qualitative Response Models

The general class of models we shall consider are those for which thedependent variable takes values 0, 1, 2, ...

In a few cases, the values will themselves be meaningful, for example

. Number of patents: y = 0, 1, 2, ... (count data)

In most cases, the value of the dependent variable is merely a codingfor some qualitative outcome:

. Labor force participation: we code “yes” as 1 and “no” as 0(qualitative choices)

. Occupational field: let 0 be clerk, 1 engineer, 2 lawyer and so on(categories)

. Opinions are usually coded over likert scales where 1 stands for“strongly disagree”, 2 “disagree”, 3 “neutral”, 4 “agree”, 5 “stronglyagree” (rankings)


Introduction

Binary Choice Model: Example 1Performance in macroeconomic test (Ex. 21.3 in Greene from Spector and Mazzeo, 1980)

Sample of 32 students

. 14 subject to a new method of teaching economics, the PersonalizedSystem of Instruction (PSI )

Dependent variable, GRADE = 1 if the student’s grade in anintermediate macroeconomics course was higher than that in theprinciples course

PSI

GRADE 0 1 total

0 15 6 21

% 83.33 42.86 65.63

1 3 8 11

% 16.67 57.14 34.38

total 18 14 32


Introduction

Binary Choice Model: Example 2Labor participation of married woman

Sample of 753 married woman

Dependent variable = 1 if the woman worked at least one hour in1975 (INLF75)

# of kids < 6 years

INLF75 0 1 2 3 total

0 231 72 19 3 325

% 38.12 61.02 73.08 100 43.16

1 375 46 7 0 428

% 61.88 38.98 26.92 0.00 56.84

total 606 118 26 3 753


Introduction

General Framework

Regression analysis is not readily applicable to these situations

We can construct models that link the decision or outcome to a set offactors, at least in the spirit of the regression

Response probability

Pr(event j occurs|x) = Pr(Y = j |x) = F (x, β)

. x: relevant effects

. β: parameters


Introduction

Choice set

Discrete choice models describe decision makers’ choices amongalternatives

. The decision makers can be people, households, firms, ...

. The alternatives might represent competing products, courses ofaction, ...

The choice set must have the following characteristics:

(1) mutually exclusive alternatives(2) exhaustive(3) finite number of alternatives


Binary Choice Models

Binary Choice Models

The dependent variable can take two values, usually coded as 0/1,indicating whether or not a certain event has occurred

. joined the labor force, effective drug, bankrupt company

Regardless of the definition of y , it is traditional to refer to y = 1 assuccess and y = 0 as failure

3 approaches1 The regression approach2 Latent variables3 Random Utility Models


Binary Choice Models The regression approach

The Regression Approach

In the linear regression model you have:

E [yi |Xi ] = β0 + β1X1i + β2X2i + ...+ βkXki = x ′iβ

yi = x ′iβ + εi

where β is a vector of unknown parameters and εi with mean zeroand homoschedastic

In the case of a binary dependent variable:

E [y |X ] = 0× Pr(y = 0|X ) + 1× Pr(y = 1|X ) = Pr(y = 1|X )

We model: Pr(y = 1|X ) = F (x ′β).

. Different choices for F leads to different models

. The set of parameters β is linked to the impact of changes in x on theprobability of success



The Linear Probability Model (LPM)

Pr(yi = 1|X ) = x ′iβ

yi = x ′iβ + εi

Interpretation of coefficients does not change with respect to thelinear model, with the specification that E (y |X ) = Pr(y = 1|X ).

If xj is not functionally related to other explanatory variable, βjmeasures the effect of the explanatory variable xj on the probability ofsuccess:

βj =∂ Pr(y = 1|X )

∂xj

In the linear model, the xj can be functions of underlying variables(e.g. log, squared...): this simply change the interpretation of βj

However, it has a number of shortcomings...



LPM: heterogeneity

ε is heteroschedastic

. If yi = 0, εi = −x ′i β with probability Pr(Yi = 0|xi ) = 1− x ′i β

. If yi = 1, εi = 1− x ′i β with probability Pr(Yi = 1|xi ) = x ′i β

. Therefore E (εi |xi ) = −x ′i β × (1− x ′i β) + (1− x ′i β)× x ′i β = 0

. But Var(εi |xi ) = E (ε2i |xi )− E (εi |xi )2 = ... = x ′i β × (1− x ′i β), i.e.

Var(εi |xi ) depends on xi. So, heteroschedasticity is present unless all the slope coefficients are

zero. The problem can be solved by using heteroskedasticity-robust standard

errors and t-tests (or by using GLS)



LPM: predicted values as probabilities?

Pr(yi = 1|X ) = x ′iβ

yi = x ′iβ + εi

The most serious flaw is that we cannot be sure that the predictionsyi = x ′i β lie in the 0-1 interval

. yi cannot surely be interpreted as a probability

. The estimated Var(εi |xi ) can assume negative values

. However, the model often seems to provide good estimates of thepartial effects on the response probability near the center of thedistribution of the independent variables



LPM: ceteris paribus increases

Pr(yi = 1|X ) = x ′iβ

yi = x ′iβ + εi

LPM assumes that probabilities increase linearly with the explanatoryvariables

Each unit increase in X has the same effect on the probability of Yoccurring regardless of the level of the X

This implication cannot literally be true as continually increasing oneof the X would eventually drive Pr(y = 1|X ) outside the unit interval

It’s more realistic to assume a smaller effect at the extreme values ofX



Requirements on F : Probit & Logit

limx ′β→∞ F (x ′β) = 1

limx ′β→−∞ F (x ′β) = 0

Probit Model: F (x ′β) = Φ(x ′β) =∫ x ′β−∞ φ(ε)dε

where Φ represents the cumulative distribution function of theN(0, 1), and φ its density function

Logit Model: F (x ′β) = exp(x ′β)1+exp(x ′β)



Which one to use?

The two distribution tend to givesimilar results for intermediatevalues of x ′β

The logistic distribution tends togive higher probabilities to y = 0if x ′β is extremely small

It is difficult to justify the choiceof one distribution on theoreticalgrounds

Historically, logit was preferreddue to computational reasons


Binary Choice Models Latent variable approach

Latent Variable approach

Example: Work/Not work can be seen as the result of an underlyinglatent variable (unobserved and unobservable) such as the differencebetween reservation wage and market wage. We only observe whetherthe individual works or not, not the underlying comparison amongreservation and market wages

The latent variable is defined as y∗i = x ′iβ + εiWe observe:

. yi = 1 if y∗i > c

. yi = 0 if y∗i ≤ c

Logit and Probit are obtained by different assumptions on ε

If the interest is in y∗, β can be interpreted as the change in thelatent variable corresponding to a change in X : rarely the case...



Probit: ε ∼ N(0, σ2)

P(y = 1|x) = P(y∗ > c |x) = P(x ′β + ε > c |x)

= P(ε > c − x ′β|x)

First identification problem: let c = 0Innocent normalization if the model contains a constant term

= P(ε > −x ′β|x) due to the simmetry of N

= P(ε < x ′β|x)

= Φ

(x ′β

σ

)Second identification problem: let σ = 1Innocent normalization: there’s no information about σ in the data



Summing up

y∗i = x ′iβ + εi

yi = 1 if y∗i > 0

yi = 0 if y∗i ≤ 0

If ε is assumed N(0, 1): Probit model

If ε is assumed standardized logistic (with known variance equal toπ2/3): Logit model


Binary Choice Models Random Utility Models

Random Utility ModelsMain references

Discrete choice models describe the choice of agents among a set ofalternative

. For the moment, we consider choice among 2 alternatives

Any agent can be considered:

. individual (consumer, commuter, household, ...)

. firm

. other...

References:

. Thurston (1927), Psychological Review

. First economic application by Luce and Duncan (1959), IndividualChoice Behavior

. Some applications:

- Transportation choice: McFadden (1974), Train (2003)- Occupational choice: Schmidt and Strauss (1975a,b)


Binary Choice Models Random Utility Models

Random Utility ModelsStructural Approach

If we are willing to assume rational behaviour: the agent chooses thealternative that maximize his/her utility. The utility is known to the agent, but not to the econometrician. You can think of utility as a latent variable

Let us consider the choice among two alternatives (exhaustive andmutually exclusive)C1: Ui1 = x ′i1β1 + εi1C2: Ui2 = x ′i2β2 + εi2We observe yi = 1 if Ui1 > Ui2, otherwise yi = 0Different hypothesis on εi leads to different models:

Pr(y = 1) = Pr(Ui1 > Ui2)

= Pr(x ′i1β1 + εi1 > x ′i2β2 + εi2)

= Pr(εi1 − εi2 > x ′i2β2 − x ′i1β1)

If εi1, εi2 ∼ N: Probit modelIf εi1, εi2 ∼ EV 1 (extreme value of first type): Logit model


Binary Choice Models Estimation and Inference in Binary Choice Models

Likelihood function

Estimation of binary choice models is usually based on the method ofmaximum likelihood (with the exception of the LPM)

y1, ..., yN : sample of size N from a population with density f (y |θ)

. θ: vector of parameters that fully characterizes the distribution of y inthe population

. Inference: we use the observed sample to get information about thevalue of θ in the population

The joint density of N independent and identically distributedobservation from the data generating process f (y |θ) is the likelihoodfunction:

f (y1, ..., yN |θ) =N∏i=1

f (yi |θ) = L(θ|y)

. Note that we write L(θ|y): the interest is on the parameters and theinformation about them that is contained in the observed data



Identification

The issue of identification must be resolved before estimation can beeven considered

Suppose we had an infinitely large sample, could we uniquelydetermine the value of θ from such as sample?

The parameter vector θ is identified (estimable) if for any otherparameter vector θ∗ 6= θ, for any data y1, ..., yN , L(θ∗|y) 6= L(θ|y)

Example: recall the case of probit where we set σ = 1



How to employ the likelihood function to estimate θ?

Principle of maximum likelihood: an event occurred because it wasmost likely toExample: 10 independent draws from a Bernoulli distribution(0, 1, 1, 0, 0, 1, 0, 0, 0, 1). The parameter θ is the (population) probability of success

The density for each observation is f (yi |θ) = θyi (1− θ)1−yi

The joint density, i.e. the likelihood for this sample is

L(θ|y) =N∏i=1

[θyi (1− θ)1−yi ] = θ4(1− θ)6

L(θ|y) is the “probability” of observing this particular sample,assuming that a Bernoulli distribution with as yet unknown parameterθ generated the data. The reference to the “probability” of observing the given sample is not

exact in a continuous distribution, since a particular sample hasprobability zero (nonetheless, the principle is the same)



10 Bernoulli draws: 0, 1, 1, 0, 0, 1, 0, 0, 0, 1

The function has a single mode at θ = 0.4, which would be themaximum likelihood estimate (MLE) of θ

Since the log function is monotonically increasing and easier to workwith, ln L(θ|y) is usually maximized instead



Properties of MLE θAssuming that regularity conditions hold

Consistency: p lim θ = θ0

Asymptotic normality: θd→ N(θ0, I (θ0)−1), where

I (θ0) = −E0

[∂2 ln L∂θ0∂θ′0

]Asymptotic efficiency: θ achieves the Cramer-Rao lower bound

Invariance: The MLE of γ0 = c(θ0) is c(θ) if c(θ0) is a continuousand continuously differentiable function



Probit & Logit: Likelihood Function

Estimation of binary choice models is usually based on the method ofmaximum likelihood (with the exception of the linear probabilitymodel)

Each observation is treated as a single draw from a Bernoullidistribution with probability of success Pr(yi = 1) = F (x ′iβ) = Fi

In case of independent observations, the likelihood function can bewritten as:

L(β|y) = Pr(Y1 = y1, ...,Yn = yn)

=∏yi=1

Fi ×∏yi=0

(1− Fi ) =n∏

i=1

F yii (1− Fi )

1−yi

Taking log: ln L(β|y) =∑n

i=1 yi lnFi + (1− yi ) ln(1− Fi )



Maximum Likelihood Estimation

ln L(β|y) =n∑

i=1

yi lnFi + (1− yi ) ln(1− Fi )

First order conditions: system of k equations in k unknowns

∂ ln L(β|y)

∂β=

n∑i=1

(yi

fiFi

+ (1− yi )−fi

1− Fi

)xi = 0

where fi = ∂Fi∂x ′i β

Nonlinear likelihood equations (except in the case of linear probabilitymodel): iterative solutions neededFrom general ML theory:√n(β − β)

d→ N(0,A−1) where A−1 = limn→∞1nE(∂2 ln L∂β ∂β′

). In most cases the inverse exists, and when it does, A is positive definite. If the matrix is not invertible, then perfect collinearity probably exists

among the regressors



Maximum Likelihood Estimation (1)Probit and Logit

In the case of probit and logit, the likelihood function is globallyconcave (the Hessian is always negative definite)

As a consequence, a unique solution exists to the maximizationproblem

In the case of logit, the FOC can be written as:

∂ ln L

∂β=

n∑i=1

(yi − Fi )xi = 0

. Orthogonality condition between generalized residuals and independentvariables

. If the model contains the constant term:n∑

i=1

yi − Fi = 0⇔n∑

i=1

yin− Fi

n= 0⇔

n∑i=1

yin

=n∑

i=1

Fi

n



Maximum Likelihood Estimation (2)Probit and Logit

In the case of probit:

∂ ln L

∂β=

n∑i=1

λix′i = 0

where

λi =

φ(x ′i β)

Φ(x ′i β) if yi = 1−φ(x ′i β)

1−Φ(x ′i β) if yi = 0



Robust estimation?Probit model

A number of applications based on the probit model make use ofWhite robust “sandwich” estimator

If the model is correctly specified (i.e. homoschedasticity holds),applying the “sandwich” estimator does not change the estimate ofthe variance covariance matrix

The presence of heteroschedasticity entirely changes the functionalform for Pr(y = 1|X ): if σ2

i 6= σ2 then Pr(y = 1|X ) 6= Φ(x′β)

As a result probit will be inconsistent for β when the error term isheteroschedastic!!!



Marginal Effects

Recall the linear regression model: yi = x ′iβ + εi. βj represents the change in yi due to a unit change in xij , ceteris

paribus:

βj = ∂E(yi |xi )∂xij

In the logit and probit models: E (yi |xi ) = Pr(yi = 1|xi ) represents theprobability of success∂E(yi |xi )∂xij

is called marginal effect and represents the change in the

probability of success due to a unit change in the independentvariable xj

∂E (yi |xi )∂xij

=∂ Pr(yi = 1|xi )

∂xij= f (x ′iβ)βj

where f (z) = ∂F∂z



Marginal Effects: Probit & Logit

Probit: ∂E(yi |xi )∂xij

= φ(x ′iβ)βj

Logit: ∂E(yi |xi )∂xij

= Fi (1− Fi )βj

The β coefficients do not have a direct interpretation

. Knowing the sign of βj is enough to determine whether the variable xjhas a positive or negative effect

. But to find the magnitude of the effect, we have to take into accountthe value of all the explanatory variables

Marginal effect of the “average” person vs. average of individualmarginal effect

. If you have dummy variable(s) in your regression, then its average doesnot correspond to any individual: it make more sense to consider 0 or 1

. To compute the marginal effect for dummy variables, take thedifference Pr1(x)− Pr0(x)



Marginal Effects: Standard errors

The Delta-Method can be employed to compute the standard error of themarginal effect:

Let θ be an estimate of the parameter of interest. For convenience let θ be a scalar

IF√n(θ − θ0)

d→ N(0, σ2)

THEN√n(g(θ)− g(θ0))

d→ N(0, g ′(θ)2σ2). In the multiparametric case, the jacobian of g with respect to θ is

employed

In our case:

Let Σ be the (asymptotic) variance-covariance matrix of thecoefficients

The (asymptotic) variance-covariance matrix of the marginal effectscan be computed as

∂ME

∂β′Σ∂ME

∂β



Testing hypothesis about your coefficients

We consider MLE of a parameter θ and a test of the hypothesisc(θ) = q

. Likelihood ratio test: compares LU (value of the likelihood function atthe unconstrained value of θ) and LR (value of the likelihood functionat the restricted estimate). If the restriction is valid, then imposing itshould not lead to a large reduction in the log-likelihood function

. Wald test: If the restriction is valid, then c(θMLE ) should be close to q

. Lagrange Multiplier test: The test is based on the slope of thelog-likelihood at the point where the function is maximized subject tothe restriction. If valid, the slope of the log-likelihood function shouldbe near zero at the restricted estimator

The three tests are asymptotically equivalent under the nullhypothesis, but can behave rather differently in a small sample

When applied to linear models, W ≥ LR ≥ LM



Testing hypothesis about your coefficientsH0: c(θ) = q, with q of size k × 1

Likelihood ratio test

LR = −2(ln LR − ln LU) = −2 lnLR

LU∼ χ2

k

Wald test

W = [c(θ)− q]′(Asy .Var [c(θ)− q])−1[c(θ)− q] ∼ χ2k

Lagrange Multiplier test (or score test)

LM =

(∂ ln L(θR)

∂θR

)′[I (θR)]−1

(∂ ln L(θR)

∂θR

)∼ χ2

k



Three asymptotically equivalent test proceduresH0: c(θ) = 0


Binary Choice Models Goodness of fit

How does your model fit the data?

Compare your model with the null model

H0: β = 0 can be tested with −2[ln L0 − ln L]d→ χ2

k−1

pseudo-R2 = 1− ln Lln L0

Various other measures have been proposed in the literature, but nostandard procedure has been agreed yet

In order to compute L0 no maximization is needed: the MLE isobtained from the share of successes in the sample



A note on pseudo-R2 = 1− ln Lln L0

pseudo-R2 = 0 when ln Lln L0

= 1

. poor fit of the model

In order to have pseudo-R2 = 1: ln L = 0⇔ L = 1

. This can only happen when Fi = 1 if yi = 1 and Fi = 0 if yi = 0

. BUT 0 < Fi < 1 (equal 0 or 1 at the limit)

. Therefore pseudo-R2 < 1



Share of correctly predicted

A useful summary of the predictive ability of the model is a 2× 2table of the hits and misses of a prediction rule such as

y = 1 if F > 0.5 and 0 otherwise

Do not place too much emphasis on this measure of goodness of fit

. Let P be the observed probability of success (share of 1 in yoursample). The naive predictor

y = 1 if P > 0.5 and 0 otherwise

would always predict correctly 100P percent of the observation. The naive model does not have a zero fit. Compute the percent correctly predicted for each outcome, y = 0 and

y = 1



Example: Labor participation of married womenExample 15.2 from Wooldridge



Specification issues in probit models

Neglected heterogeneity

Omitted variables

. These problems are relevant for all index models

. Since the normal distribution allows us to obtain concrete results, thefocus is on probit

In linear models:

. Heterogeneity causes OLS to be inefficient, however it is still consistent

. Omitted variables can lead to inconsistent estimates, unless...

- the omitted variable does not affect y- the omitted variable is uncorrelated with x



Neglected heterogeneity (1)

The (structural) model of interest is: Pr(y = 1|X ) = Φ(x ′β + γc)

. In a latent variable form: y∗ = x ′β + γc + ε, where y = 1[y∗>0]

. The interest is in the partial effect of the xj on the probability ofsuccess, holding c and the other elements of X constant

Assume c independent of X and c ∼ N(0, τ2)

Then, γc + ε independent of X and has a normal distribution withmean zero and variance σ2 = γ2τ2 + 1

Pr(y = 1|X ) = Pr(γc + ε > −x ′β|X ) = Φ(x ′β/σ)

Probit of y on X consistently estimates β/σ

As σ = (γ2τ2 + 1)1/2 > 1, estimates suffer of attenuation bias:|βj/σ| < |βj |



Neglected heterogeneity (2)

However, interest is in

∂ Pr(y = 1|X )/∂xj = βjφ(x ′β + γc)

What we can consistently estimate is βj/σφ(x ′β/σ). Different from ∂ Pr(y = 1|X )/∂xj evaluated at c = 0 (not really

informative...). Equal to the average partial effect:

E [βjφ(x ′β + γc)] =βjσφ

(x ′β

σ

)Omitted heterogeneity is not a problem in probit when it isindependent of X (average partial effects can be consistentlyestimated)

However, if c is correlated with X (or otherwise dependent on X ),then the omission of c is serious: we cannot get consistent estimatesof APE



Continuous endogenous explanatory variable

One possibility: 2SLS applied to LPM

To apply probit estimation, stronger assumptions are needed

In the following model assume normally distributed error terms:

y∗1 = z1δ1 + α1y2 + u1, where y1 = 1[y∗1 >0]

y2 = z1δ21 + z2δ22 + u2 = zδ2 + u2

If u1 and u2 are independent, no endogeneity problem exists

On the contrary if y2 is correlated with u1, APE can be estimated

. Normalization of the variance of u1 is needed (σ2u1

= 1)

Under the assumption of joint normality, full maximum likelihood canbe employed for estimation



Continuous endogenous explanatory variableTwo-step approach (Rivers and Vuong, 1988)

As (u1, u2) are jointly normal and correlation exists among the twoterms, we can write:

u1 = θ1u2 + e1 =cov(u1, u2)

var(u2)u2 + e1

Because of joint normality of (u1, u2), e1 is also normal with meanzero and variance Var(u1)− θ2

1Var(u2) = 1− θ21

If u2 would be observable, APE could be estimated from the model:

y∗1 = z1δ1 + α1y2 + u1 = z1δ1 + α1y2 + θ1u2 + e1

Two-step approach(a) Run OLS of y2 on z : you get a consistent estimate of the residuals u2

(b) Run probit of y1 on z1, y2, and u2

Test of exogeneity – H0: θ1 = 0



Binary endogenous variable

y1 = 1[z1δ1+α1y2+u1>0]

y2 = 1[z2δ1+u2>0]

The average treatment effect is Φ(z1δ1 + α1)− Φ(z1δ1)

Under normality assumption, the joint distribution of (y1, y2) isspecified: maximum likelihood can be applied for estimation

Rivers-Vuong approach can be employed to test for endogeneity

Alternatively a score test for H0: cov(u1, u2) = 0 can be applied (itdoes not require estimation of the full model)


Categorical variable models

Categorical data

Y is the result of a single decision among more than 2 alternatives

Unordered choice set: Categories/Qualitative choices

. multinomial logit, conditional logit, nested logit

Ordered choice set (rankings): models for ordered data

. ordered probit


Categorical variable models Models for categories/qualitative choices

Example: Education and Occupational Choice

EducationPrimary/Secondary University

Occupation School or more Total

Menial 23 (74.19%) 8 (25.81%) 31 (100%)Blue Collar 60 (86.96%) 9 (13.04%) 69 (100%)Craft 65 (77.38%) 19 (22.62%) 84 (100%)WhiteCol 27 (65.85%) 14 (34.15%) 41 (100%)Prof 27 (24.11%) 85 (75.89%) 112 (100%)Total 202 (59.94%) 135 (40.06%) 337 (100%)



Multinomial distribution

Yi : qualitative random variable with J categories

Pij = Pr(Yi = j), j = 1, 2, ..., J

. Probability that individual i will choose alternative j

Categories are mutually exclusive and exaustive:∑j

Pij = 1, i = 1, 2, ...,N

Let di = (di1, di2, ..., diJ), where dij = 1 if Yi = j

.∑

j dij = 1, i = 1, 2, ...,N



Multinomial logit model (MNL)

Y : result of a choice among J alternatives (J > 2)

di = (di1, di2, ..., diJ), where dij = 1 if Yi = j

Pij = Pr(Yi = j),∑

j Pij = 1

Logit model:

Pr(Yi = j) =exp(ηij)∑Jl=1 exp(ηil)



Properties of MNL

0 ≤ Pij ≤ 1∑j Pij = 1 (by definition)

For every pair of alternatives (k , l), the probability ratio is

Pik

Pil=

exp(ηik)

exp(ηil)⇒ log

Pik

Pil= ηik − ηil

The model can be motivated by a random utility model



Random Utility Models (1)McFadden (1973, 2001)

J alternatives: mutually exclusive, exhaustive, finite set

. Examples: competing brands, different means of transport, differentoccupations, ...

Categories can be ordered or unordered

. Different tecniques will be employed according to the nature of thealternatives

. Assume non-ordered alternatives

Rational agent chooses the alternative that maximizes his/her utility:Yi = j if Uij > Uik for each k 6= j



Random Utility Models (2)McFadden (1973, 2001)

Linear utility model: Uij = ηij + εij with ηij = LC (zij , θ). ηij links the agent utility to factors that can be observed. ηij is different from Uij since there are factors that cannot be observed

by the researcher

Pr(Yi = j) = Pr(Uij > Uik , ∀k 6= j)

= Pr(ηij + εij > ηik + εik , ∀k 6= j)

= Pr(εik − εij < ηij − ηik , ∀k 6= j)

=

∫εI(εik−εij<ηij−ηik ,∀k 6=j)f (ε)dε

with f probability density function of ε

The model is made operational by a particular choice of distributionfor the disturbance. Closed functional forms exist only for few specifications (e.g. logit)



How to specify ηij?

‘Standard’ MNL

. ηij = x′iβj

. x individual characteristics, constant across all the alternatives j

Conditional logit model

. ηij = z′ijγ

. zij characteristics of the choice j and individual i

- Datasets typically analyzed by economists do not contain mixtures ofindividual and choice-specific attributes

- CLM is usually applied when the interest is in the effect ofchoice-specific attributes

- Custom transformation is needed for variables containingindividual-specific attributes



‘Standard’ MNL

Pr(Yi = j |xi ) =exp(x′iβj)∑Jl=1 exp(x′iβl)

It is not possible to estimates all the β1, ..., βJ

By adding a constant to all the βs, the probability doesn’t change

Indeterminacy in the model is removed by letting β1 = 0

J = 1 is the reference category

Pr(Yi = j |xi ) =exp(x′iβj)

1 +∑J

l=2 exp(x′iβl)

Intercept in the model is allowed by letting the first column of xi = 1for every i



Estimation: MLE

The log likelihood can be written as

ln L =n∑

i=1

J∑j=1

dij ln Pr(Yi = j)

. with dij = 1 if Yi = j , 0 otherwise

The derivatives have the characteristically simple form:

∂ ln L

∂βj=∑i

(dij − Pij)xi = 0

As a consequence, if the model is estimated with an intercept,∑i dij =

∑i Pij = 1



Interpretation of the parameters

The partial effects for this model are complicated:

∂Pj

∂xi= Pj

[βj −

J∑k=1

Pkβk

]= Pj [βj − β]

. The coefficients in this model are difficult to interpret:∂Pj/∂xk need not have the same sign as βjk

A simpler interpretation by considering the odds ratio:

. lnPij

Pi1= x′iβj

. lnPij

Pik= x′i (βj − βk) if k 6= 1

In case of dummy variables (coded as 0 or 1)

. lnPij(xi =1)

Pi1(xi =1)= βj

. lnPij(xi =1)

Pik(xi =1)= βj − βk if k 6= 1



Conditional logit model

Pr(Yi = j |zj) =exp(zjβ)∑J

k=1 exp(zkβ)

The model contains choice-specific attributes

The coefficients of individual-specific attributes (that do not varyacross categories) are not identified

. Individual-specific variable can be inserted in the model, but need to beproperly transformed

All the coefficients of the choice-specific attributes cannot beseparately identified: adding a constant to all the coefficients doesnot change the estimated probability

. The intercept is set to zero



Marginal effects

∂Pj(z)

∂zk= βk [Pj(z)(I(j=k) − Pk(z))]

∂Pj (z)∂zj

= βz [Pj(z)(1− Pj(z))]

∂Pj (z)∂zh

= −βzPj(z)Ph(z) (j 6= h)

Pj change monotonically with respect to z

The sign of the derivative depends on the sign of βz

Opposite effect by considering zj or zh

Simmetry:∂Pj

∂zh= ∂Ph

∂zj

Pj does not change if all the variables zkh change in the samedirection (the ranking of Uij is unchanged!)



Multinomial logit (MNL) vs conditional logit (CNL)

Similar response probabilities, but they differ in some importantrespects

MNL: the conditioning variables do not change across alternatives. Characteristics of the alternatives are unimportant or not of interest, or

data are not available. Example: occupational choice – we do not know how much someone

could make in every occupation. We can collect data on factors affecting individual productivity and

tastes, e.g education, past experience. MNL: factors can have different effects on relative probabilities

(different βj for different choices)

CNL: choices on the basis of observable attributes of each alternative. Common β

MNL as a special case of CNL

Important limitation: independence from irrelevant alternativesassumption



Independence from irrelevant alternatives (logit)

For every pair of alternatives (k , l), the probability ratio (odd) is

ω =Pr(Yi = k|xik)

Pr(Yi = l |xil)=

exp(ηik)

exp(ηil)

ω depends only on the linear predictors (η) of the consideredalternatives, not on the whole set of alternatives

From the point of view of estimation, it is useful that the odds ratiodoes not depend on the other choices

But it is not a particularly appealing restriction to place on consumerbehaviour



IIA: example by McFadden (1984)

Commuters initially choosing between cars and red buses with equalprobabilities

Suppose a third mode (blue buses) is added and commuters do notcare about the colur of the bus (i.e. will chose between these withequal probability)

IIA imply that the fraction of commuters taking a car would fall from12 to 1

3 , a result that is not very realistic



Testing IIAHausman and McFadden (1984)

If a subset of the choice set is truly irrelevant, omitting it from themodel altogether will not change the parameter estimatessistematically

Exclusion of these choices will be inefficient but will not lead toinconsistency

But if the remaining odds are not truly independent from thesealternatives, then the parameter estimates obtained when thesechoices are included will be inconsistent

Therefore, Hausman’s specification test can be applied



The Hausman’s specification test

Consider two different estimators θE and θI

Under H0, θE and θI are both consistent and θE is efficient relative toθI

Under H1, θI remains consistent while θE is inconsistent

Then H0 can be tested by using the Hausman statistics:

H = (θI − θE )′[Est.Asy .Var(θI − θE )]−1(θI − θE )

= (θI − θE )′[Est.Asy .Var(θI )− Est.Asy .Var(θE )]−1(θI − θE )d→ χ2

J

The appropriate degree of freedom for the test will depend on thecontext

In the case of MNL, J is the number of parameter in the estimatingequation of the restricted choice set



What if IIA hypothesis is not satisfied?(1) Multivariate probit model

Uj = β′xj + εj , j = 1, ..., J, [ε1, ε2, ..., εJ ] ∼ N(0,Σ)

Pr(Yi = j) = Pr(Uj > Uk , j = 1, 2, ..., J, ∀k 6= j)

Main obstacle: difficulty in computing the multivariate normalprobability for any dimensionality higher than 2

Recent advances in accurate simulations of multinormal integrals havemade estimation of MNP more feasible

. Simulation-based estimation



What if IIA hypothesis is not satisfied?(2) Generalized extreme value: Nested logit models

Very appealing if it is possible to assume sequential choicesThe J alternatives are grouped into L subgroups:(1) First the group of alternative is chosen(2) Then, one alternative is chosen within the group

IIA is maintained within groups, but does not need to hold acrossgroupsMain limitations. Results can depend on the way in which groups are formed.... There is no specification test to discriminated among different

groupingsLaura Magazzini (@univr.it) Multiple Choice Models 65 / 72

Categorical variable models Treatment of rankings

Ordered data

Y can assume a limited number of categories yc , c = 0, 1, ...,C

Categories are inherently ordered: y0 < y1 < y2 < yCExamples:

. Bond rating: AAA-D

. Symptoms: none, minor, serious

. Drug effect: worsen, none, partial recovery, full recovery

. Customer satisfaction: very unsatisfied, unsatisfied, satisfied, verysatisfied

. ...

Ordered probit and logit models

. Multinomial models would fail to account for the ordinal nature of thedependent variable

. OLS would attach a meaning to the difference between the categorycodings



Latent regression

We consider a continuous latent variable y∗ (unobserved), linearfunction of x and ε: y∗ = x ′β + εWe observe y = c ⇔ γc < y∗ ≤ γc+1, with γ0 = −∞ e γC+1 = +∞The latent response is specified by a linear regression model withoutthe intercept



Ordered Probit Modely∗ = x ′β + ε with ε ∼ N(0, 1)

Pr(yi = 0|x) = Pr(y∗i ≤ γ1) = Pr(εi ≤ γ1 − x ′β|x) = Φ(γ1 − x ′β)Pr(yi = 1|x) = Pr(γ1 < y∗i ≤ γ2) = Φ(γ2 − x ′β)− Φ(γ1 − x ′β)...Pr(yi = C |x) = Pr(y∗i > γC ) = 1− Φ(γC − x ′β)

Usually y∗ has no real meaning

The interest is in Pr(y |x) rather than E (y∗|x)

To identify the parameters: x cannot contain the intercept

. If you have to specify a model with an intercept, set γ1 = 0



Marginal effects

Coefficients are difficult to interpret:∂ Pr(yi=0|x)

∂xj= −βjφ(γ1 − x ′β)

. sign opposite to the sign of βj

∂ Pr(yi=c|x)∂xj

= βj [φ(γc+1 − x ′β)− φ(γc − x ′β)]

. ambiguous sign!!!

∂ Pr(yi=C |x)∂xj

= βjφ(γC − x ′β)

. same sign as βj



Changes in y and y ∗ in response to changes in x

Increasing one of the x ’s while holding β and γ constant is equivalentto shifting the distribution of y∗ to the right (solid to dashed curve)



Ordered Logistic Regression: εi ∼ logisticaProportional odds model

Pr(yi > c) =exp(x ′i β−γc )

1+exp(x ′i β−γc )

log(

Pr(yi>c)1−Pr(yi>c)

)= x ′iβ − γc

Pr(yi>c)/[1−Pr(yi>c)]Pr(yj>c)/[1−Pr(yj>c)] = exp[(xi − xj)

′β]

. Doesn’t depend on the threshold



Ordered Probit vs. Ordered Logit

Coefficients and threshold parameters are different due to differentscale factors (σprobit = 1, whereas σlogit = π2/3)

Predicted probabilities are similar

Marginal effects are similar

If the logit is chosen, estimated coefficients can be interpreted interms of odds


Documents

Multiple Choice Models - phdeconomics.sssup.it · -Transportation choice: McFadden (1974), Train (2003)-Occupational choice: Schmidt and Strauss (1975a,b) Laura Magazzini (@univr.it)