The Journal of Choice Modellingtrain/flexible.pdf(2007), Train(2008), Foxet al. (2011), Burdaet al. (2008)and Fosgerau and Mabit (2013). The current method is an approximate generalization

Contents lists available at ScienceDirect

The Journal of Choice Modelling

The Journal of Choice Modelling 19 (2016) 40–53

http://d1755-53

☆ MatE-m

journal homepage: www.elsevier.com/locate/jocm

Mixed logit with a flexible mixing distribution$

Kenneth TrainDepartment of Economics, University of California, Berkeley, United States

a r t i c l e i n f o

Article history:Received 28 February 2016Received in revised form19 July 2016Accepted 21 July 2016Available online 3 August 2016

Keywords:Mixed logitMixing distributionNonparametric

x.doi.org/10.1016/j.jocm.2016.07.00445/& 2016 Elsevier Ltd. All rights reserved.

lab codes to implement the procedures in tail addresses: [email protected], train

a b s t r a c t

This paper presents a flexible procedure for representing the distribution of randomparameters in mixed logit models. A logit formula is specified for the mixing distribution,in addition to its use for the choice probabilities. The properties of logit assure positivityand provide the normalizing constant for the mixing distribution. Any mixing distributioncan be approximated to any degree of accuracy by this specification. The researcher de-fines variables to describe the shape of the mixing distribution, using flexible forms suchas polynomials, splines, and step functions. The gradient of the log-likelihood is easy tocalculate, which facilitates estimation. The procedure is illustrated with data on con-sumers' choice among video streaming services.

& 2016 Elsevier Ltd. All rights reserved.

1. Introduction

A mixed logit model (e.g., Revelt and Train, 1998) contains two parts: a logit specification of a person's probability ofchoosing a given alternative, which depends on parameters that enter the person's utility function; and a specification of thedistribution – often called the mixing distribution – of these utility-parameters over people. McFadden and Train (2000)have shown that a mixed logit model can, under benign conditions, approximate any choice model to any degree of ac-curacy. However, this theoretical generality is constrained in practice by the difficulty of specifying and estimating para-meter distributions that are sufficiently flexible and yet feasible from a computational perspective. The vast majority ofstudies have used normal and lognormal distributions, with a few using Johnson's Sb, gamma, and triangular distributions.However, these distributions are limiting, and most researchers will probably agree that: whatever parametric distributionthe researcher specifies, he/she quickly becomes dissatisfied with its properties.

The current paper introduces a new way of specifying the distribution of random parameters that is relatively simplenumerically and yet allows a high degree of flexibility. For a finite parameter space, the probability of each parameter valueis given by a logit function with terms that are defined by the researcher to describe the shape of the distribution. Theresearcher can specify polynomials, splines, steps, and other functions that have been developed for general approximation.The exponential in the logit numerator assures that the probability is positive, and the summation in the denominatorassures that the probabilities sum to one. Sampling from the parameter space is facilitated by the fact that the logitprobability takes the same form on subsets as the full set. When used in a mixed logit, the model contains two logits: one forthe decision-maker's choice among alternatives, and another for the “selection” of parameters for the decision-maker. Theprocedure is easy to program and fast computationally.

Procedures for flexible mixing distributions have been previously proposed by Bajari et al. (2007), Fosgerau and Bierlaire

his paper are available at http://eml.berkeley.edu/�train/[email protected]

www.sciencedirect.com/science/journal/17555345

www.elsevier.com/locate/jocm

http://dx.doi.org/10.1016/j.jocm.2016.07.004



http://eml.berkeley.edu/~train/software.html

http://eml.berkeley.edu/~train/software.html

http://crossmark.crossref.org/dialog/?doi=10.1016/j.jocm.2016.07.004&domain=pdf



mailto:[email protected]

mailto:[email protected]


K. Train / Journal of Choice Modelling 19 (2016) 40–53 41

(2007), Train (2008), Fox et al. (2011), Burda et al. (2008) and Fosgerau and Mabit (2013). The current method is an approximategeneralization of the first four of these papers. Burda et al. (2008) obtain flexibility through a convolution of a normal kernelwith a skewing function. Fosgerau and Mabit (2013) suggest an approach that approximates a different function than thedensity of the utility parameters. In particular, random terms from a standardized distribution (such as uniform or standardnormal) are transformed as they enter utility, and this transformation is approximated by, e.g., a polynomial.

This movement toward greater flexibility, which the current paper augments, shifts the emphasis for future researchfrom overcoming distributional constraints to developing richer datasets that can more clearly distinguish among thevariety of shapes that the parameter distribution might take.

Sections 2 and 3 describe the model form and its estimation. Section 4 gives examples of how to specify variables todescribe the mixing distribution. Section 5 extends the procedure in two fairly obvious ways. Section 6 provides an ap-plication, and Section 7 concludes.

2. A Logit-mixed logit (LML) model

Consider a situation in which each decision-maker makes only one choice, since the double-use of logits is most apparentin this situation; generalization to multiple choices by each decision-maker is described in the next section. Let the utilitythat person n obtains from alternative j in choice set J be denoted in the usual way as β ε= ′ +U xnj n nj nj where xnj is a vector ofobserved attributes, βn is a corresponding vector of utility coefficients that vary randomly over people, and εnj is a randomterm that represents the unobserved component of utility. The unobserved term εnj is assumed to be distributed iid extremevalue. Under this assumption, the probability that person n chooses alternative i, conditional on βn, is the logit formula:

β( ) =∑ ( )

β

β

′

∈′Q

e

e 1ni n

x

j Jx

n ni

n nj

The researcher does not observe the utility coefficients of each person and knows that the coefficients vary over people. Thecumulative distribution function of βn in the population is β( )F which is called the mixing distribution. Let F be discrete withfinite support set S. This specification is not restrictive since a continuous distribution can be approximated to any degree ofaccuracy by a discrete distribution with a sufficiently large and dense S. The probability mass at any β ∈ Sr is expressed as alogit formula:

β β β α( = ) ≡ ( | ) =∑ ( )

α β

α β

′ ( )

∈′ ( )W

e

eProb

2n r r

z

s Sz

r

s

where β( )z r is a vector-valued function of βr and α is a corresponding vector of coefficients. The z variables are chosen tocapture the shape of the probability mass function; their specification is discussed in Section 4 below.

The unconditional choice probability is then:

∑ ∑β α β( ) = ( | )· ( ) =∑

·∑ ( )

α β

α β

β

β∈ ∈

′ ( )

∈′ ( )

′

∈′

⎛

⎝⎜⎜

⎞

⎠⎟⎟

⎛

⎝⎜⎜

⎞

⎠⎟⎟n i W Q

e

e

e

eProb chooses

3r Sr ni r

r S

z

s Sz

x

j Jx

r

s

r ni

r nj

The model contains a logit formula for the probability that the decision-maker chooses alternative i and a logit formula forthe probability that the decision-makers has utility coefficients βr. The researcher's task is to specify the x variables thatdescribe the probability of each alternative and the z variables that describe the probability of each βr.

The advantage of using a logit model for the mixing distribution is that it allows for easy and flexible specification ofrelative probabilities. The researcher specifies z variables that describe the shape of the distribution, without needing to beconcerned about assuring positivity or summation to one: the exponential in the logit numerator guarantees that theprobability at each point is positive, and the sum in the denominator guarantees that probabilities sum to one over points.

The specification is entirely general in the sense that any choice model with any mixing distribution can be approximatedto any degree of accuracy by a model of the form of Eq. (3). McFadden's (1975) “mother logit” theorem shows that any modelthat describes the choice among alternatives can be represented by a logit formula of the form in Eq. (1). An analogousderivation applies for representing the mixing distribution as a logit formula.

Result: For any mixing distribution, there exists a sequence of probability distributions in the form of Eq. (2) that con-verges weakly (i.e., in distribution) to that mixing distribution. Stated more intuitively, any mixing distribution can beapproximated to any degree of accuracy by a logit model of the form given in Eq. (2). Proof: For any distribution F*, thereexists a sequence of discrete distributions, labeled = …F m, 1, 2,m , that converges weakly to F* (e.g. Chamberlain, 1987).1 Letthe probability mass function associated with distribution Fm be denoted β( )k fm m at each β ∈ Sm, where km is the normalizingconstant, β( )fm is the kernel, and Sm is the support set. Define β β( ) = ( ( ))g flnm m , which exists for each β ∈ Sn; that is, function

1 Dan McFadden has suggested (personal communication) a simple demonstration: Take m draws from F* and define β β β( ) = ∑ ( ≤ )=F I m/m im

i1 . By thestrong law of large numbers, this sequence converges weakly to F*.

K. Train / Journal of Choice Modelling 19 (2016) 40–5342

gm is the log of the kernel of the probability mass function. Eq. (2) with α = 1 and β β( ) = ( )z gr m r returns β( )k fm m r :

ββ

ββ

β∑

=( )

∑ ( )=

( )∑ ( )

= ( )( )

α β

α β

( )

∈( )

∈ ∈

ee

f

f

k f

k fk f .

4

z

s Sz

m r

s S m r

m m r

s S m m rm m r

r

ms

m

If β( )fln m cannot be calculated exactly, then it can be approximated by a linear-in-parameters form α β′ ( )z with appropriateselection of z variables, such as step functions, splines or polynomials; see, e.g., Theorems 4.7.1–4.7.3 respectively in Sohrab(2003). This derivation points out that the goal of the researcher is to specify z variables that capture the shape of the log ofthe density of the random coefficients.2

3. Estimation

We now allow for multiple choices by each decision-maker. Index choice situations by t and denote the attributes ofalternative j for person n in choice situation t as xnjt. Suppose the person chose alternative it in choice situation t, andconsider the person's sequence of chosen alternatives …i iT1 in T choice situations. The probability that person n made thissequence of choices, conditional on βn, is:

∏β β( ) = ( )( )= …

L Q5

n nt T

ni t n1, ,

t

where β( )Q ni t ntis given by Eq. (1) appropriately modified to represent choice situation t. The (unconditional) probability of

the sequence of choices is then:

∑ β β α= ( ) ( | )( )∈

P L W6

nr S

n r r

The parameters to be estimated are the vector α that describe the mixing distribution. The log-likelihood function for αgiven a sample indexed by = …n N1, , is

∑ ∑ β β α= ( ) ( | )( )= … ∈

⎛⎝⎜⎜

⎞⎠⎟⎟L WLL ln

7n N r Sn r r

1, ,

S might be so large that calculating LL is infeasible. If so, then the log-likelihood function can be simulated in the usualway by using random draws of βr for each person (McFadden, 1989; Hajivassiliou and Ruud, 1994; Train, 2009). Let ⊂S Sn bea subset of R randomly selected values of β, with all elements of S having the same probability of being selected.3 Thesimulated log-likelihood function is:

∑ ∑ β β α= ( ) ( | )( )∈

⎛

⎝⎜⎜

⎞

⎠⎟⎟L wSLL ln

8n r Sn r n r

n

where wn is the logit formula based on subset Sn:

β α( | ) =∑ ( )

α β

α β

′ ( )

∈′ ( )w

e

e 9n r

z

s Sz

r

ns

The estimator is the value of α that maximizes SLL. This simulation-based estimator is consistent, asymptotically normal,and asymptotically equivalent to the non-simulated maximum likelihood estimator if (i) R rises faster than N and (ii) thenon-simulated maximum likelihood estimator is itself consistent and asymptotically normal (Gourieroux and Monfort,1993; Lee, 1995; Train, 2009).4,5

2 As mentioned above, Fosgerau and Mabit (2013) proposed an approach for flexible mixing distributions that is based on an alternative representationof the mixed logit formula. The difference can now be explained. For random u with a standardized density g(u), the utility coefficients are represented bytransformation β θ= ( | )T u which depends on parameters θ. The mixed logit formula becomes ∫ θ( ( | )) ( )L T u g u duni . The researcher selects g, such as uniform,and a flexible form for T, such as a polynomial, and estimates the parameters θ. The procedure is easy to code and fast computationally. The density of theutility parameters, β( )f , is implied by the estimated transformation and the specified g. This density is not easy to derive, but it can be readily simulated forany estimated transformation and specific g. However, the relation of T and g to f is complex, and it is often not clear whether, or how, particular forms of Tand g restrict the possible shapes of f. Clarifying this relation is a worthwhile goal for future research.

3 The generalization to unequal selection probabilities is discussed in Section 5.2 below.4 The use of a subset of β's in estimation is similar to McFadden's (1978) suggestion to estimate standard (i.e., fixed-coefficient) logit models on a

subset of alternatives. However, there is a difference. The chosen alternative is necessarily included in the subset of alternatives, which necessitates acorrection term in estimation or, as usually applied, a sampling procedure that satisfies the “uniform conditioning property.” In contrast, the decision-maker's actual β (which might be called the “chosen” β) is not observed and need not be included in the subset Sn. The conditions for consistency aretherefore the same as those for any maximum simulated likelihood estimator.

5 Note that β( )wn r is an overestimate of β( )W r by the ratio of denominators. However, for any >R 0, the simulated mean of a statistic β( )t using wn isunbiased for its mean based on W: β β β β∑ ( ) ( ) = ∑ ( ) ( )∈ ∈

⎡⎣ ⎤⎦E t w t Wr Sn r n r r S r n . Other statistics, such as the mode, are more difficult to simulate.


An important advantage of using a logit formula for the mixing distribution is that it gives easy-to-calculate gradients formaximization of the log-likelihood:

∑ ∑α

β α β α β∂∂

= ( ( | ) − ( | )) ( )( )∈

⎡

⎣⎢⎢

⎤

⎦⎥⎥

SLLh w z

10n r Sn r n r r

n

where

β αβ β α

β β α( | ) =

( ) ( | )∑ ( ) ( | ) ( )∈

hL w

L w 11n r

n r n r

s S n s n sn

is the probability mass at βr conditional on person n's sequence of choices. The first term in the gradient is the differencebetween the conditional and unconditional probability of βr. The estimated parameters are those at which this differencebecomes uncorrelated with the z variables; stated alternatively, if the z variables are correlated with this difference at anygiven value of the parameters, then the model can be improved by changing the parameters.6

For computation, note the following. First, and very importantly, β( )Ln r , the probability of the person's choices conditionalon βr, does not depend on the parameters α that are being estimated. This term therefore does not change during theoptimization process for estimation. Given a set Sn of points for a person, this probability is calculated once at each point inthe set, and does not need to be recalculated at each iteration of the optimization procedure. The optimization procedurechanges the weights β α( | )wn r associated with each value of βr, but not the probability β( )Ln r of the person's choices at that βr.This feature reduces computation time considerably. Second, the parameters α are estimated by a standard logit model withan “alternative” for each βr and with the “dependent variable” for each alternative being the conditional weight associatedwith that value of βr. The dependent variable (i.e., conditional weights) changes with each iteration, unlike a standard logit,but otherwise, the speed of standard logit estimation applies.

The covariance matrix of the estimator can be calculated as the inverse of the Hessian or the BHHHmatrix, or by using bothin the sandwich estimator, with the BHHH matrix easily calculated as the sample covariance of the bracketed term in Eq. (10).7

Usually, however, the researcher will be interested in statistics that depend on α, rather than being concerned about α itself.The delta method can be used for these statistics, or the sampling distribution can be simulated by the bootstrap. The latter isconceptually straightforward and, importantly, avoids the need to derive and code-up derivative formulas.

4. Variables for the mixing distribution

The critical issue is: how to specify the z variables that describe the mixing distribution? By representing the mixingdistribution as a logit function, which assures positivity and summation to one, the researcher can utilize the numerousprocedures that have been developed for approximation of standard functions. The possibilities are most readily describedthrough examples.

4.1. Approximate normal and lognormal

It is doubtful that a researcher would want to use this type of mixed logit model for only normals and lognormals, sincethese distributions can be readily handled with the usual mixed logit software. However, it is instructive to see how normalsand lognormals can be accommodated in the current setup. Also, there can be an advantage to using the logit representationof a normal or lognormal, since it eliminates (though specification of S) the long tails of the normal and lognormal, whichcan be unrealistic in real-world choice situations.

The normal density for β with mean b and variance V contains a linear combination of variables in its exponential, withthe coefficients depending on b and V and the variables depending on β:

β β β β β β( ) = ( ) ( − ( − )′ ( − ) ) = ( ) ( − ( ′ ) + ′ − ( ′ ) ) ( )− − − −f m V b V b m V V b V b V bexp /2 exp /2 /2 121 1 1 1

where m(V) is the normalizing constant. The exponentiated term is linear in the elements of β and the unique elements ofββ′, which means that this density can be represented exactly by the logit of Eq. (3). If the elements of β are independent(i.e., V diagonal), then the z variables are each element of β and each element squared. For correlation among elements of β,cross-products of the elements are included as additional z variables.8 The logit denominator provides the normalizing

6 Note that the average difference β α β α∑ ( ( | ) − ( | ))∈ h wr Sn n r n r is necessarily zero for each person, since each probability, β α( | )hn r and β α( | )wn r , sums to oneover β ∈ Sr n such that their difference sums to zero.

7 Most optimization codes calculate a numerical Hessian at convergence, which avoids the need to derive and code-up the analytic Hessian.8 The number of z variables equals the number of unique parameters in b and V, and the coefficients α are functions of b and V. Estimates of b and V can

be calculated from the estimate of α. However, as stated above, it is doubtful that the method would be used to represent just a normal distribution; withhigher-order polynomials or in combination with other functional forms, translation of the estimated coefficients of the second-order terms to b and V (i.e.,to the normal component of the more general density) is not particularly useful or meaningful.


constant. And since the term ( ′ )−b V b /21 in the normal density does not vary over β's, it drops out of the logit formula. Log-normal distributions are represented analogously with β( )ln replacing β.

4.2. Higher-order polynomials

The example above consists of specifying z to be a second-order polynomial in β. The polynomial can be extended tohigher order to gain greater flexibility. Orthogonal polynomials provide the added advantage of reducing collinearity amongthe terms. Legendre polynomials are commonly used for numerical approximation of functions in general and are parti-cularly useful in the current setup. Consider one-dimensional β. The support of Legendre polynomials is [ − ]1, 1 , and so thetransformation β β˜ = − + ( − ) ( − )a b a1 2 / is used, where a and b are the lower and upper bounds of β in S. Let ( )k qLGP , bethe k-th order Legendre polynomial of degree k for q. The z variables are β β( ) = ( ˜)z kLGP ,k for = …k K1, , where K is thehighest order specified by the researcher. The logit numerator, α β( ′ ( ))zexp , based on these variables can represent a verywide variety of shapes depending on the value of α, and the flexibility rises with more terms. Dependence among theelements of multi-dimensional β is captured though cross-products of the terms of each element's polynomial. All cross-products need not be included; e.g., the cross-product of only the first-order terms provides correlation among the elementsof β without greatly increasing the number of distributional parameters. Chebyshev, Bernstein, or other polynomials can beused instead. Most coding packages, including Matlab, contain commands to calculate the various polynomials.

This approach is very similar to that described by Fosgerau and Bierlaire (2007). The difference is that the currentapproach inserts the polynomials into a logit function, while Fosgerau and Bierlaire define the probability at a point to bethe squared sum of weighted polynomials divided by a normalizing constant that depends on the parameters. The logitfunction has a more convenient gradient and allows the polynomials to be combined with other z variables.

4.3. Step functions

Let the set S be partitioned into (possibly overlapping) subsets labeled Hg for = …g G1, . Let β( )W be the same for allpoints within each subset, but different over subsets. The logit formula for the probability masses is:

β( ) =∑ ( )

α β

α β

∑ ( ∈ )

∈∑ ( ∈ )

We

e 13r

I H

s SI H

g g r g

g g s g

The z variables are the G indicators of which subsets contain βr. If the subsets do not overlap, then one of the coefficients isnormalized to zero. With overlapping subsets, one coefficient is normalized to zero for each possible way of covering the set S.

The use of overlapping sets Hg can reduce the number of parameters with little, if any, loss in interpretation. An example is touse subsets for each single coefficient (to provide a step function for each marginal) and then use more coarsely defined subsets foreach pair of coefficients to provide a coarser multi-dimensional step function that captures correlation. Fig. 1 depicts an example intwo dimensions. There are six intervals in each dimension, giving 36 partitioned squares. Estimating a parameter for each squarewould result in 35 parameters (with the summation to 1 providing the 36th share), which can be considered the saturatedspecification. Consider instead overlapping partitions, with six rectangles differentiating one of the dimensions, as in Fig. 1a; sixrectangles differentiating the other dimension, as in Fig. 1b; and four large squares differentiating over both dimensionals, as inFig. 1c. This specification contains 5þ5þ3¼13 parameters and yet provides as flexible a distribution in each dimension as thesaturated specification as well as correlation over the dimensions (through the parameters for the four large squares). The extra 22parameters in the saturated specification provide extra multi-dimensional flexibility, but might not warrant the large number ofextra parameters. The reduction in number of parameters from using overlapping subsets in this way is greater with more di-mensions and more steps in each dimension. Essentially. the researcher's selective use of subsets can allow for finely defined gridswithout a proliferation of parameters and with meaningful, interpretable relations among the probabilities at all points in S.

Bajari et al. (2007), Train (2008) and Fox et al. (2011) proposed estimation of the probability mass at each point in S, withS specified as a grid. Their model is a type of latent class model, with each point representing a class, but with the location ofthe points being specified rather estimated as in a standard latent class model. The LML model generalizes their approach byallowing each Hg to contain more than one point and to overlap. A numerical limitation of their approach is that the numberof parameters is the number of grid points, which becomes unwieldy or infeasible, even for moderately coarse grids. Forexample, a grid with only six points in each of seven dimensions contains 279,936 points, each of which, in their approach,represents a parameter. As well as presenting numerical difficulties for optimization and calculation of standard errors, it isdoubtful that a model with so little structure is really needed. In general, the LML model can be seen as a latent class modelwhere: (1) the points are specified instead of estimated and (2) the share at each point need not be treated as an individualparameter but rather can be related to the shares at other points through the z variables. Both of these features allow farmore points to be included in LML models than traditional latent class models.

4.4. Splines

Linear splines can be written in the form α β′ ( )z as needed for our specification. Consider, as an example, the spline β( )f forone-dimensional β with starting point β̄1, knots at β̄2 and β̄3, and endpoint β̄4. The distributional parameters are the height of

Fig. 1. Overlapping step functions.


the spline at these points, which are labeled α α…, ,1 4. The formula for the spline is

β

αα αβ β

β β β β

αα αβ β

β β β β β

αα αβ β

β β β β

α β( ) =

+−

¯ − ¯ ( − ¯ ) ≤ ¯

+−

¯ − ¯ ( − ¯ ) ¯ < ≤ ¯

+−

¯ − ¯ ( − ¯ ) ¯ <

= ′ ( )

( )

⎧

⎨

⎪⎪⎪

⎩

⎪⎪⎪

⎫

⎬

⎪⎪⎪

⎭

⎪⎪⎪

f z

if

if

if14

12 1

2 11 2

23 2

3 22 2 3

34 3

4 33 3

where z contains four elements:

ββ ββ β

β β ββ ββ β

β ββ ββ β

β β β ββ ββ β

β β β

β ββ β

β β ββ ββ β

β β

( ) = −− ¯

¯ − ¯ ( ≤ ¯ ) ( ) =− ¯

¯ − ¯ ( ≤ ¯ ) + −− ¯

¯ − ¯ ( ¯ < ≤ ¯ ) ( ) =− ¯

¯ − ¯ ( ¯ < ≤ ¯ )

+ −− ¯

¯ − ¯ ( ¯ < ) ( ) =− ¯

¯ − ¯ ( ¯ < )

⎛⎝⎜⎜

⎞⎠⎟⎟

⎛⎝⎜⎜

⎞⎠⎟⎟

⎛⎝⎜⎜

⎞⎠⎟⎟

z I z I I z I

I z I

1 1

1

11

2 12 2

1

2 12

2

3 22 3 3

2

3 22 3

3

4 33 4

3

4 33

and (·) =I 1 if the statement in parentheses is true and ¼0 otherwise.The discrete probability mass at any point in S is defined by the logit Eq. (3) evaluated at the spline function. The

exponentiation changes the shape of the spline, but its flexibility remains. Since the overall height is irrelevant, one of theparameters (heights) is normalized, such that, in the above example, there are three free parameters to be estimated.Multidimensional splines are defined analogously.

4.5. Combinations

The z variables described above can be combined to take advantage of the properties of each. One particularly usefulcombination is to specify a step-function or spline for each single coefficient, and then capture correlation over coefficientswith a second-order polynomial. This procedure utilizes fewer parameters for correlations than creating multi-dimensionalstep-functions or splines. The specification can be viewed as consisting of a joint normal distribution (created by the second-order polynomial) with the marginals re-shaped based on a step function or spline.

4.6. Method of sieves

The use of polynomials, step functions and linear splines within the LML model can be viewed as a type of nonparametric(or semi-nonparametric) estimation of the mixing distribution, by the method of sieves (Chen, 2007). The number of α


parameters (i.e., the order of the polynomial and/or the number of steps/nodes) rises with sample size, providing ever-moreflexibility in fitting the true distribution. The requirements for consistency and asymptotic normality, and the rates ofconvergence, are important topics that are beyond the scope of this paper. These issues depend greatly on the statistics thatare being evaluated. For example, requirements for consistency are less stringent and the rates of convergence faster whenevaluating summary statistics of the mixing distribution (where the rising number of α parameters are treated as nuisanceparameters) than when assessing the maximum error over the density's support.

5. Extensions

5.1. WTP space

Train and Weeks (2005) describe how the utility parameters in a mixed logit can be represented as the person's will-ingness-to-pay rather than the coefficients entering the person's utility. They call the former representation “models in WTPspace” and the latter “models in preference space.” The procedure in the current paper for specifying the distribution ofutility parameters is described above for models in preference space. However, the procedure is equally applicable to modelsin WTP space. The only change is that utility entering the logit probability of choice is changed from β′xn nit to

σ− ( + ′ )r wtp xn nit n nit where rnit is price, wtpn is a vector of willingness to pay for each non-price attribute, and sn is a randomscalar. The vector β is re-defined as σ< >wtp, , after which β enters the same in all subsequent equations.

5.2. Unequal probability sampling

The simulated log-likelihood function is defined above for a sample of β's selected with equal probability from S. Non-equal probabilities can be used in the random selection of β's for each person's simulated probability, using importancesampling. In particular, let β( )q be a probability mass function over β ∈ Sr that is determined by the researcher. The LLfunction can be re-written as

∑ ∑ β β β α β= ( ( ) ( )) ( | ) ( )( )= … ∈

⎛⎝⎜⎜

⎞⎠⎟⎟L q W qLL ln /

15n N r Sn r r r r

1,

which is the same as (7) but with the sampling probability made explicit. Simulation points in Sn are selected randomly fromS with probability β( )q r . The simulated log-likelihood then becomes:

∑ ∑ β β β α= ( ( ) ( )) ( | )( )∈

⎛

⎝⎜⎜

⎞

⎠⎟⎟L q wSLL ln / .

16n r Sn r r n r

n

The term β β( ) ( )L q/n r r does not depend on α and is calculated only once, rather than in each iteration of maximization. The use of qfor sampling of the βr's allows the researcher to expand the size of S (i.e., use wider ranges in each dimension) while, for example,giving greater importance to those points near the edge of the ranges where polynomials can take extreme and inaccurate shapes.

6. Application

To illustrate the procedure, I utilize data from a conjoint experiment designed and described by Glasgow and Butler(forthcoming) for consumers' choice among video streaming services. Their experiments included the monthly price of theservice and the non-price attributes given in Table 1.

Each choice experiment included four alternative video streaming services with specified price and attributes plus a fifthalternative of not subscribing to any video streaming service. Each respondent was presented with 11 choice situations.Butler and Glasgow obtained choices from 300 respondents and implemented an estimation procedure that accounts forprotestors (mainly consumers who never chose a service that shared data). For the present use of their data, I do not includethe 40 respondents whom they identified as protestors, so that the sample consists of 260 respondents.

6.1. Normal distribution

The first column of Table 2 gives a model in WTP-space where the WTP's are specified to be jointly normal and the price/scale coefficient s is lognormal.9 I first estimated the model by the hierarchical Bayes (HB) procedure in Train (2009)modified for WTP space as in Train and Weeks (2005, Ch. 12) and Scarpa et al. (2008) using my own Matlab codes. I then

9 These estimates are the same as those in Table 10 of Ben-Akiva et al. (2015). The estimated correlation coefficients, not shown here, are reported byBen-Akiva et al.

Table 1Non-price attributes.

Attribute Levels

Commercials Yes (“commercials”)between content No (baseline category)

Speed of contentavailability

TV episodes next day, movies in3 months (“fast content”)TV episodes in 3 months, movies in6 months (baseline)

Catalog 5000 movies and 2500 TV episodes(baseline)10,000 movies and 5000 TV episodes(“more content”)2000 movies and 13,000 TV episodes(“more TV, fewer movies”)

Data-sharing Information is collected but not shared(baseline)

policies Usage info is shared with third parties(“share usage”)Usage and personal info shared (“shareusage and personal”)


used the HB estimates as the starting values for maximum simulated likelihood in Stata using the command “mixlogitwtp”.Estimation in Stata took a little over 4 h on my PC. The simulated log-likelihood rose from �4017.10 with the HB estimatesto �3903.47 with the maximum likelihood estimates.

The estimates indicate that people are willing to pay $1.56 per month on average to avoid commercials. Fast availability isvalued highly, with an average WTP of $3.94 per month in order to see TV shows and movies soon after their originalshowing. On average, people do not want to have more TV shows with fewer movies. But consumers are WTP $2.96 onaverage to obtain twice as much content of both kinds. Consumers are estimated to have a WTP of 62 cents per month toavoid having their usage data shared in aggregate form; however, the hypothesis of zero average WTP cannot be rejected.Consumers are much more concerned about their personal information being shared along with their usage information:the average WTP to avoid such sharing is estimated to be $2.70 per month.

6.2. Polynomials

Consider now the procedure described above for representing the distribution with polynomials. Using the estimatedmeans and standard deviations in the first column of Table 2, the parameter space S was specified to extend two standarddeviations above and below the mean of each WTP and from 0 to 2 for the price/scale coefficient.10 Each dimension wasdivided into 1000 evenly-spaced points, such that S contained a total of 1024 multi-dimensional grid points (which is atrillion trillions.) A random sample of 2000 points was drawn independently for each person in the sample. The z variableswere specified as a sixth-order polynomial on each utility parameter and, to capture correlations, a second-order poly-nomial on each pair of WTPs. Notice that even though the parameter space contains many points (approximating a con-tinuous space), only 69 parameters are being estimated.11

The parameters were estimated by maximum simulated likelihood, with standard errors calculated by bootstrapping.Estimation was very fast: I wrote the code in Matlab using its parallel processing toolbox and ran on a computer with a TeslaK20 gpu, which operates well with Matlab's gpu processing. Estimation on the original sample (i.e. not bootstrapped) took18 s, converging in 615 iterations, with starting values of zero. Bootstrapping (i.e., resampling the respondents, taking newdraws, recreating the data based on the new draws, and estimating) took 16 min for 20 resamples. Without the parallelprocessing toolbox, which allows for gpu processing, run times were about ten times longer, which is still sufficiently fast toallow extensive testing and exploration of alternative specifications.12

10 I always estimate a standard logit first and then a mixed logit with all normal coefficients, in order to check for any problems or anomalies in theestimation process and results. The estimates from the model with all normals can be used to assist in specifying the parameter space for the more flexibleprocedures.

11 The model with a continuous normal distribution (column 1) was estimated with 100 Halton draws per person; 2000 draws for each person isprobably more than needed when the continuous distribution is replaced with a discrete grid and the number of parameters raised from 37 to 69.

12 These times cannot be directly compared to the four hour run-time for the model with all normally distributed WTPs because different codes (Stataand Matlab) were used for the two models.

Table 2Model in WTP space.

Normal Polynomial Spline

MeansCommercials �1.562 �2.808 �2.736

(0.4214) (0.5469) (0.4913)

Fast availability 3.945 3.155 2.979(0.4767) (0.4275) (0.4264)

More TV, fewer movies �0.6988 �0.2231 �0.1274(0.4783) (0.4464) (0.4323)

More content 2.963 2.514 2.820(0.4708) (0.3716) (0.2688)

Share usage only �0.6224 0.4827 0.3265(0.4040) (0.4097) (0.4481)

Share personal and usage �2.705 �2.813 �3.263(0.5844) (0.6153) (0.5742)

No service �27.26 �34.09 �34.10(2.662) (2.065) (2.361)

Price/scale 0.2378 0.3006 0.2943(0.0236) (0.0356) (0.0269)

Standard deviationsCommercials 3.940 3.537 3.305

(0.5307) (0.2771) (0.2836)

Fast availability 3.631 4.304 4.348(0.4138) (0.3643) (0.4117)

More TV, fewer movies 4.857 4.728 4.727(0.5541) (0.4719) (0.4943)

More content 2.524 2.684 2.816(0.4434) (0.1397) (0.2172)

Share usage only 2.494 2.025 1.905(0.4164) (0.2404) (0.2508)

Share personal and usage 6.751 6.452 6.764(0.7166) (0.5171) (0.4950)

No service 19.42 23.08 24.06(2.333) (1.518) (1.392)

Price/scale 0.2539 0.4276 0.4166(0.0391) (0.0593) (0.0433)


The second column of Table 2 gives the mean and standard deviation of each utility parameter under this specification.The left side of Fig. 2 graphs the estimated marginal distribution for each utility parameter. Note that the means andstandard deviations are fairly similar to those obtained with a normal distribution, but the shape of the distribution for eachutility coefficient is quite different from a normal. Several coefficients are bimodal. For example, most people do not caremuch about whether their personal and usage information is shared, but a group of respondents are willing to pay aconsiderable amount to avoid this form of sharing. Interestingly, some people are estimated to like having their personaland usage data shared, presumably because they value the promotional information that is targeted to them based on their

Fig. 2. Distribution of utility parameters.


information. Sharing usage information alone (without personal information) is valued positively by an even greater shareof the population, since this form of sharing provides benefits in the form of recommended content, with little loss ofconfidentiality.

The following pairs of attributes have estimated correlations (not shown) that are statistically significant:

� WTP for sharing usage information and for sharing personal and usage information (0.443): people who like having theirusage information shared also tend to like having their personal and usage information shared, and similarly for peoplewho dislike sharing.

� WTP for fast content and for more TV with fewer movies (0.410), indicating that people who want content to be availablequickly also want to have more TV shows. Stated more intuitively: people who like TV shows more than movies also wantthe TV shows to be available soon after they are originally shown.

� WTP for commercials and for more content (0.372): people who want more content also don't mind seeing commercialsas much as other people.

� WTP for sharing personal and usage information and no service (�0.285): people who dislike having their personal andusage information shared are also more willing to do without the video service altogether.

In the model with normals, the first and second of these correlations are significant, and the other two are not.The model attained a SLL at convergence of �3864.85, compared to �3903.47 for the model with a normal distribution.

In the latter, the normal distribution was estimated with unbounded support, which means that the two models are notprecisely nested. For direct comparison, an LML model was estimated with a second-order polynomial instead of six-order.The SLL was �3912.86, giving a likelihood ratio test statistic of 96.02. The critical value of chi-squared with 32 degrees offreedom (the number of additional parameters in the more general model) is 47.40 at the 95% confidence level. The hy-pothesis that the extra parameters are zero can be rejected.

Fig. 2. (continued)


6.3. Splines

Using the same specification of S, a model was estimated using the combination described above, namely, a six-segmentspline for each utility parameter and a second-order polynomial in the WTPs. The model contains 83 parameters: 6 for eachdimension's spline, giving 48 in total, plus 35 for the polynomial in the 7 WTPs. Using the gpu, estimation took about 50 s onthe original sample, with convergence in 1669 iterations, and the bootstrapping took 37 min.

The estimated means and standard deviations are given in the third column of Table 2, and the marginal distributions areshown on the right side of Fig. 2. The means and standard deviations are fairly similar to those for the other two models. Theshapes of the distributions are similar to those obtained with polynomials, and very different from normal.

The correlations that were significant in the polynomial model are also significant in the spline model and have similarmagnitudes (0.476 v. 0.443; 0.448 v. 0.410; 0.486 v. 0.372; �0.228 v. �0.285). One additional correlation is significant in thespline model, namely, WTP for fast content and for sharing personal and usage information (�0.356): people who want fastcontent also dislike having their personal and usage data shared.

7. Discussion

The procedure allows for fast and easy specification of flexible mixing distributions. However, the flexibility placesadditional burden on the researcher to specify the range of each utility parameter and the z variables that describe the shapeof the distribution. The researcher's data might not be sufficiently informative to provide authoritative guidance in makingthese specification decisions. Different distributions can provide very similar log-likelihood values, which leaves the re-searcher with the unsatisfying option, given current data, of choosing one distribution over another based on tiny differ-ences in log-likelihood.

Fig. 2. (continued)


The range of values that is allowed for each coefficient or WTP is also an issue. Parameters might be expected, by rationalchoice, to be non-negative, or non-positive, for all people, and yet a model that specifies a distribution with the theoreticallyexpected support for these parameters might (and in my experience, often does) fit worse than a model that allows a shareof values with the “wrong” sign. In the current application, for example, WTP for commercials, fast content, and morecontent can be theoretically signed. The spline model was re-estimated with the specified end-point for the range of WTPsbeing zero (at the high end for commercials and the low end for fast content and more content), and the log-likelihooddropped from �3858.74 to �3886.70. This reduction is not evidence that the procedure for estimating the mixing dis-tribution is problematic. Rather, the problem arises because the data do not contain enough information to rule out the-oretically implausible behavior.

These issues point to the need for richer data that can more meaningfully distinguish among distributions and betteralign empirical results with theoretical expectations or requirements. The availability of procedures that accommodateflexible distributions shifts the focus for future research away from concern about overcoming specification constraints todeveloping forms of data that more clearly reveal the distribution of consumers' preferences.

In regard to the LML model itself, four interrelated issues seem to me to be the most relevant topics for further in-vestigation and improvement. (1) As stated above, delineating the relation of LML models to nonparametric estimationwould greatly enhance our understanding of how best to specify and interpret the models. (2) Non-equal probabilitysampling of points from S has the potential to improve the fit of the mixing distribution and increase the range of thecoefficients in S. This potential can be explored through applications and Monte Carlo exercises. (3) It would be useful todetermine the extent to which summary statistics for the estimated mixing distribution, such as the mean and standarddeviation, depend on the range of coefficients that is used to define S, and how the range can be specified to obtain moreaccurate and robust estimates of the relevant summary statistics. (4) The tails of distributions are often vitally important forpolicy and marketing purposes, e.g., what share of households are willing to pay more than some specified amount for animprovement in an attribute. Flexible methods, such as LML, allow the tails to be determined by the data rather than by

Fig. 2. (continued)


assumption, but increase the requirement for data that are in the tails (see, for example, Borjesson et al., 2012). Topic (2) isrelated to (3) and (4), since unequal probability sampling from S expands the possibilities for feasible specification of S andidentification of the tails.

References

Bajari, P., Fox, J., Ryan, S., 2007. Linear regression estimation of discrete choice models with nonparametric distributions of random coefficients. Am. Econ.Rev. 97 (2), 459–463.

Ben-Akiva, M., McFadden, D., Train, K., 2015. Foundations of stated preference elicitation, working paper, Department of Economics, University of California,Berkeley. http://eml.berkeley.edu/train/foundations.pdf.

Borjesson, M., Fosgerau, M., Algers, S., 2012. Catching the tail: empirical indentification of the distribution of the value of travel time. Transp. Res. Part A:Policy Pract. 46 (2), 378–391.

Burda, M., Harding, M., Hausman, J., 2008. A Bayesian mixed logit-probit model for multinomial choice. J. Econ. 147 (2), 232–246.Chamberlain, G., 1987. Asymptotic efficiency in estimation with conditional moment restrictions. J. Econ. 34, 305–334.Chen, X., 2007. Large sample sieve estimation of semi-nonparametric models. In: Heckman, J., Leamer, E. (Eds.), Handbook of Econometrics, vol. 6A. , North-

Holland, New York. (Chapter 76).Fosgerau, M., Bierlaire, M., 2007. A practical test for the choice of mixing distribution in discrete choice models. Transp. Res. Part B: Methodol. 41 (7),

784–794.Fosgerau, M., Mabit, S., 2013. Easy and flexible mixing distributions. Econ. Lett. 120, 206–210.Fox, J., Kim, K., Ryan, S., Bajari, P., 2011. A simple estimator for the distribution of random coefficients. Quant. Econ. 2, 381–418.Glasgow, G., Butler, S., 2016. The value of non-personally identifiable information to consumers of online services: evidence from a discrete choice ex-

periment, Applied Economics Letters, (forthcoming).Gourieroux, C., Monfort, A., 1993. Simulation-based inference: a survey with special reference to panel data models. J. Econ. 59, 5–33.Hajivassiliou, V., Ruud, P., 1994. Classical estimation methods for LDV models using simulation. In: Engle, R., McFadden, D. (Eds.), Handbook of Econo-

metrics, North-Holland, New York, pp. 2383–2441.Lee, L., 1995. Asymptotic bias in simulated maximum likelihood estimation of discrete choice models. Econ. Theory 11, 437–483.

http://refhub.elsevier.com/S1755-5345(16)30013-6/sbref1



http://eml.berkeley.edu/train/foundations.pdf

























McFadden, D., 1978. Modelling the choice or residential location. In: Karlqvist, A., Lundqvist, L., Snickars, F., Weibull, J. (Eds.), Spatial Interaction Theory andPlanning Models, North-Holland, New York, pp. 105–142.

McFadden, D., 1989. A method of simulated moments for estimation of discrete response models without numerical integration. Econometrica 57,995–1026.

McFadden, D., Train, K., 2000. Mixed MNL models of discrete response. J. Appl. Econ. 15, 447–470.McFadden, D., 1975. On independence, structure and simultaneity in transportation demand analysis, Working Paper No. 7511, Institute of Transportation

and Traffic Engineering, University of California, Berkeley.Revelt, D., Train, K., 1998. Mixed logit with repeated choices. Rev. Econ. Stat. 80, 647–657.Scarpa, R., Theine, M., Train, K., 2008. Utility in willingness to pay space: a tool to address confounding random scale effects in destination choice in the

Alps. Am. J. Agric. Econ. 90 (4), 994–1010.Sohrab, H., 2003. Basic Real Analysis, Birkhuser, Boston. second ed., 2014, Springer, Dordrecht.Train, K., 2008. EM algorithms for nonparametric estimation of mixing distributions. J. Choice Model. 1, 40–69.Train, K., 2009. In: Discrete Choice Methods with Simulation, 2nd ed. Cambridge University Press, New York.Train, K., Weeks, M., 2005. Discrete choice models in preference space and willingness-to-pay space. In: Scarpa, R., Alberini, A. (Eds.), Applications of

Simulation Methods in Environmental and Resource Economics, Springer, Dordrecht, pp. 1–17.




















Documents

The Journal of Choice Modellingtrain/flexible.pdf(2007), Train(2008), Foxet al. (2011), Burdaet al. (2008)and Fosgerau and Mabit (2013). The current method is an approximate generalization