44
Random Choice and Learning Paulo Natenzon * Washington University in Saint Louis December 2012 Abstract We study a decision maker who gradually learns the utility of available alternatives. This learning process is interrupted when a choice must be made. From the observer’s perspective, this formulation yields a random choice model. We propose and analyze one such learning process, the Bayesian Probit Process. We apply our model to the similarity puzzle : Tversky’s similarity hypothesis asserts that if two options, x and y have roughly equal utility and z is similar to x then making z available will reduce the probability of choosing x more than it reduces the probability of choosing y. However, there is also evidence that introducing a similar but inferior z may hurt y more than x and can even increase the probability of choosing x (attraction effect). We provide a definition of similarity based only on the random choice process and show that z is more like x than y if and only if the correlation between z and x’s signals is larger than the correlation between z and y’s signals. Our main result establishes that when z and x are correlated and sufficiently close in utility, introducing z hurts y more than x and may even increase the probability of choosing x early in the learning process, but eventually hurts x more than y. Hence, the attraction effect arises when the decision maker is relatively uninformed, while Tversky’s hypothesis holds when she becomes sufficiently familiar with the options. We then show that if z is similar to x and sufficiently inferior, the attraction effect never disappears. * E-mail: [email protected]. This paper is based on the first chapter of my doctoral dissertation at Princeton University. I wish to thank my advisor, Faruk Gul, for his continuous guidance and dedication. I am grateful to Wolfgang Pesendorfer for many discussions in the process of bringing this project to fruition. I also benefited from numerous conversations with Dilip Abreu, Roland B´ enabou, Meir Dan-Cohen, Daniel Gottlieb, Justinas Pelenis, and seminar participants at Alicante, Arizona State, Berkeley, D-TEA Workshop, IESE, IMPA, Johns Hopkins, Kansas University, Haifa, Hebrew University, NYU, Princeton, the RUD Conference at the Colegio Carlo Alberto, SBE Meetings, Toronto, UBC, Washington University in St. Louis and Yale. All remaining errors are my own. 1

Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

Embed Size (px)

Citation preview

Page 1: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

Random Choice and Learning

Paulo Natenzon∗

Washington University in Saint Louis

December 2012

Abstract

We study a decision maker who gradually learns the utility of available alternatives.This learning process is interrupted when a choice must be made. From the observer’sperspective, this formulation yields a random choice model. We propose and analyzeone such learning process, the Bayesian Probit Process. We apply our model to thesimilarity puzzle: Tversky’s similarity hypothesis asserts that if two options, x and yhave roughly equal utility and z is similar to x then making z available will reduce theprobability of choosing x more than it reduces the probability of choosing y. However,there is also evidence that introducing a similar but inferior z may hurt y more thanx and can even increase the probability of choosing x (attraction effect). We providea definition of similarity based only on the random choice process and show that z ismore like x than y if and only if the correlation between z and x’s signals is larger thanthe correlation between z and y’s signals. Our main result establishes that when zand x are correlated and sufficiently close in utility, introducing z hurts y more than xand may even increase the probability of choosing x early in the learning process, buteventually hurts x more than y. Hence, the attraction effect arises when the decisionmaker is relatively uninformed, while Tversky’s hypothesis holds when she becomessufficiently familiar with the options. We then show that if z is similar to x andsufficiently inferior, the attraction effect never disappears.

∗E-mail: [email protected]. This paper is based on the first chapter of my doctoral dissertation atPrinceton University. I wish to thank my advisor, Faruk Gul, for his continuous guidance and dedication. Iam grateful to Wolfgang Pesendorfer for many discussions in the process of bringing this project to fruition.I also benefited from numerous conversations with Dilip Abreu, Roland Benabou, Meir Dan-Cohen, DanielGottlieb, Justinas Pelenis, and seminar participants at Alicante, Arizona State, Berkeley, D-TEA Workshop,IESE, IMPA, Johns Hopkins, Kansas University, Haifa, Hebrew University, NYU, Princeton, the RUDConference at the Colegio Carlo Alberto, SBE Meetings, Toronto, UBC, Washington University in St. Louisand Yale. All remaining errors are my own.

1

Page 2: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

1 Introduction

The similarity puzzle is best introduced with an example. Let’s consider a con-

sumer who is uncertain about her preference over two comparable computers: a

Mac and a PC. She hasn’t shopped for computers in a few years, she doesn’t

follow technology blogs, and she is not sure which one will turn out to be better

for her office. If pressed to make a decision, she is equally likely to choose either

one:Pshe chooses the MacPshe chooses the PC

= 1

Suppose we add a third computer to the menu, and the new option is similar to

one of the existing alternatives. To fix ideas, suppose the original PC option is a

Dell, and we introduce a similar PC made by Toshiba. What should happen to

the ratio above? If the Toshiba is chosen with any positive probability when the

three options are available, it will necessarily ‘hurt’ the probability that the Mac

is chosen, or the probability that the Dell is chosen, or both. But which existing

option should the Toshiba hurt proportionally more?

The similarity puzzle arises from disparate answers to this question. On the

one hand, there is a large body of evidence and theoretical work, culminating in

Tversky’s similarity hypothesis, based on the idea that the ratio should increase

once the Toshiba is introduced. But there is also a large body of empirical

evidence showing that in many situations the ratio decreases, the most prominent

example being the attraction effect. Let’s now look at each side of this puzzle in

turn.

Tversky (1972b) proposes the following similarity hypothesis :

The addition of an alternative to an offered set ‘hurts’ alternatives that

are similar to the added alternative more than those that are dissimilar

to it.

According to this hypothesis, the consumer substitutes more intensely among

similar options, so the Toshiba should hurt the Dell proportionally more. Tver-

sky’s similarity hypothesis summarizes a large body of work from the 1960’s and

the 1970’s which tried to understand how similarity affects choices. The simi-

larity hypothesis is one of the underlying principles in most of the modern tools

used in discrete choice estimation (see McFadden (2001) and Train (2009)).

2

Page 3: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

While it may be tempting to treat Tversky’s similarity hypothesis as a general

principle, it is incompatible with a large amount of empirical evidence originated

in the marketing literature, the most prominent example being the attraction

effect. The attraction effect refers to a situation in which the probability of an

existing alternative being chosen increases with the introduction of a similar but

inferior alternative.1

In our consumer example, suppose the Toshiba computer, while being almost

identical to the Dell, is clearly worse in minor ways, such as having a lower pro-

cessing power, a slightly smaller screen, and costing a little more. In other words,

while the Dell and the Toshiba share many features, the Toshiba is clearly a dom-

inated option. It would not be surprising if almost no one chose the Toshiba in

this situation. What is perhaps surprising is that in many experimental settings

the presence of the dominated option increases the probability that the dominant

option is chosen. In the example, adding a dominated Toshiba helps boost the

sales of the Dell to the detriment of the Mac.

The attraction effect stands in stark contrast to Tversky’s similarity hypoth-

esis. In the attraction effect, the introduction of the Toshiba hurts the Mac

proportionally more. Moreover, while the probability of choosing the Mac is re-

duced, the probability of choosing the Dell actually increases. This constitutes

a violation of monotonicity, a very basic property for models of discrete choice.

Monotonicity (also known as regularity) states that the probability of existing al-

ternatives cannot decrease when an option is removed from the choice set. Every

random utility model, including the multinomial logit, probit, the family of gen-

eralized extreme value models, and virtually all models currently used in discrete

choice estimation are monotonic, and therefore incompatible with the attraction

effect.

In this paper, we propose a model that allow us to understand and resolve the

similarity puzzle. The model explicitly incorporates two frictions in the decision

making process. First, the decision maker gradually learns the utility of the

available alternatives. A parameter t ≥ 0 captures the amount of information

available to the decision maker at the time a choice is made. Second, for any

1Huber et al. (1982) and Huber and Puto (1983) provide the first experimental evidence for the attractioneffect. Since then, it has been confirmed in many different settings, including choice over political candidates,risky alternatives, investment decisions, medical decisions and job candidate evaluation. For these and otherexamples, see Ok et al. (2012) and the references therein.

3

Page 4: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

given amount of information the decision maker has about the alternatives (for

example, if the decision maker is allowed to contemplate the alternatives for a

fixed amount of time), similarity determines how certain the decision maker is

about the relative ranking of each pair of alternatives. In the model, similarity

is captured by a parameter γij for each pair of alternatives (i, j).

Our main results show how these two new dimensions, similarity and infor-

mation, interact to explain the similarity puzzle. While Tversky’s similarity

hypothesis holds for more informed choices (for a sufficiently large level of infor-

mation), the opposite happens and may even lead to the attraction effect for less

informed choices (for a sufficiently small level of information).

In our model, stochastic choice arises from the decision maker’s incomplete

information about the choice alternatives. The decision maker lacks, ex-ante, any

information about the alternatives that would favor a choice of one alternative

over another. When contemplating the menu of choices, the decision maker grad-

ually learns the utility of the alternatives. This learning process is interrupted

when a choice must be made. From the point of view of the observer, this yields

a random choice model.

A random choice rule describes frequencies of choice in each menu for a fixed

level of information precision. By indexing a family of random choice rules accord-

ing to information precision, we model observable choice behavior as a random

choice process. The decision maker can ultimately be described by a rational

preference relation represented by a utility function µ. Observed choices increas-

ingly reflect this rational preference as the decision maker’s information about

the alternatives becomes more precise.

Formally, the model can be described as follows. Let A be the universe of

choice alternatives. The decision maker has a standard rational preference re-

lation in A. This preference is represented by the numeric utility µi of each

alternative i ∈ A. In addition, each pair of alternatives i, j is described by a

correlation parameter γij. The decision maker knows γ but does not know µ.

Her beliefs ex-ante about the values of µi are iid Gaussian. When facing a choice

from menu b, the decision maker first observes a noisy signal about the true util-

ities, and updates her beliefs about µ using Bayes’ rule. The signal correlation

is given by γ. The probability of choosing option i in menu b is equal to the

probability that alternative i has the highest posterior mean belief among the

4

Page 5: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

alternatives in b. We call the resulting random choice rule the Bayesian probit

rule. We allow the precision of the signals received to vary continuously from

pure noise to arbitrarily precise signals. The precision of the signals is captured

by the parameter t ≥ 0. A family of Bayesian probit rules ordered according to

signal precision t is a Bayesian probit process.

The key component of the model proposed in this paper is a new behavioral

definition of similarity. This new notion of similarity is subjective and identified

(revealed) from choices, much as utility is identified in classical revealed pref-

erence theory. The theory allows us to identify, from choice data, one ranking

of alternatives according to preference, and one ranking of pairs of alternatives

according to similarity.

The behavioral notion of similarity that we propose in this paper abstracts

from (and is orthogonal to) utility. Our definition is best motivated with the

following geometric example. Suppose the universe of choice alternatives is the

set of all triangles in the euclidean plane. Suppose further that your tastes over

triangles are very simple: you always prefer triangles that have larger areas.

So your preferences over triangles can be represented by the utility function that

assigns to each triangle the numerical value of its area. Among the options shown

in Figure 1, which triangle would you choose?

Figure 1 illustrates that it may be hard to discriminate among comparable

options. Since the triangles in Figure 1 have comparable areas, they are close in

your preference ranking, and you may have a hard time picking the best (largest)

one. If you were forced to make your choice with little time to examine the

options, you would have a good chance of making a mistake.

If the difference in area were more pronounced, i.e., if one triangle were much

larger than the other, you probably wouldn’t have any trouble discriminating

among the options. Likewise, a consumer presented with two products, of which

one is much cheaper and of much better quality, should seldom make a mistake.

Hence, any probabilistic model of choice should include the gap in desirability

(measured as the difference in utility) as an important explanatory variable for

the ability of subjects to discriminate among the options.

But the gap in utility alone cannot tell the whole story. Even if the utility of

a pair of objects is kept exactly the same, it is possible to increase your ability

to discriminate among a pair of options by changing some of their other features.

5

Page 6: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

For example, suppose you still prefer larger triangles, and consider the pair shown

in Figure 2. The triangle shown on the left in Figure 2 has exactly the same area

as the triangle on the left in Figure 1, while the triangle on the right is the same

in both Figures. Hence, from the point of view of desirability, the pair shown

in Figure 1 is identical to the pair shown in Figure 2. However, note that you

are less likely to make a mistake when choosing among the options in Figure 2

(make your pick).

You probably chose the triangle on the left. You probably also found this

choice task much easier than the task in Figure 1. Why? It certainly has nothing

to do with the gap in the desirability of the two options—as we pointed out,

the pairs in both Figures exhibit exactly the same difference in area. So what

happened?

When we substituted triangle i′ for triangle i, we kept desirability (the area)

the same, but we made every other feature of triangle i′ more closely resemble

triangle j. What changed from the pair in Figure 1 to the pair in Figure 2 is

that we kept constant the features you care about (the area), while increasing

the overlap of other features. And this increased overlap clearly helped improve

your ability to discriminate among the options. In the theory that we propose

in this paper, we build on this intuition and call the pair (i′, j) of Figure 2 more

similar than the pair (i, j) of Figure 1. Similarity is measured independently

from utility, and is another important factor determining the ability of a decision

maker to discriminate among the choice options.

The pair of triangles in Figure 2 is in fact similar according to the math-

ematical definition for geometric figures. In euclidean geometry, two triangles

are similar when they have the same internal angles. The concept of geometric

similarity abstracts from size, and only refers to the shape of the objects being

compared. In the same spirit, the notion of similarity that we propose in this

paper abstracts from utility. But instead of defining the similarity of two alterna-

tives based on any of their observables attributes, we infer the level of similarity

perceived by the decision maker from her choices. Similarity in our theory is a

behavioral notion: it is subjective and inferred from choice data.

To understand the formal definition of similarity, suppose all choice data is

obtained under a fixed a level of information precision t and consider two pairs of

alternatives i, j and k, `. Under the assumptions of the model, the utility of

6

Page 7: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

the alternatives can be inferred from pairwise choices: in any given pair, and for

any positive level of information, the more desirable alternative is chosen more

often. When each object in a pair is chosen exactly 50 percent of the time, for

every level of information precision, we infer that they are equally desirable.

Without loss of generality, suppose that i is better than j. Suppose moreover

that k is at least as good as i, and that j is at least as good as `. In other

words, the gap in desirability is wider in the pair k, `. Without any difference

in similarity across the pairs, the gap in desirability alone show allow the decision

maker to discriminate among k and ` at least as well as among i and j. In other

words, based solely on the gap in utility in each pair, the better option k should

be chosen from k, ` at least as often as the better option i is chosen from i, j.But it is exactly when k is not chosen from k, ` at least as often as i is chosen

from i, j, that the pair i, j is revealed to have a larger degree of similarity.2

A second visual example helps illustrate this point.3 Suppose the universe

of choice alternatives is the set of all star-shaped figures on the euclidean plane.

Your tastes over star-shaped figures are again very simple. This time, you only

care about the number of points on a star. For example, when the options are a

six-pointed star and a five-pointed star, you always strictly prefer the six-pointed

star. With just a glance at the options in Figure 3, which star would you pick?

Again, if pressed to make a choice in a short amount of time, you may be

likely to make a mistake. The probability of making a mistake would certainly

decrease if you were given more time. You would also be more likely to make

the correct choice if one of the alternatives had a much larger number of points

than the other. Given a limited amount of time, mistakes are likely, because the

options are comparable and therefore difficult to discriminate.

Now consider the options in Figure 4. If you still prefer stars that have more

points, which of the two options would you choose? Here, similarity comes to the

rescue. The star on the left is the same in both Figures and the star on the right

has the same number of points in both Figures. Hence the gap in desirability is

the same in both Figures. But even though the difference in utility is the same,

there is a much greater overlap of features in the pair shown in Figure 4. Hence

2The attentive reader will notice that pairwise choices may only partially reveal the similarity ranking.As we show later, choices over menus of three options are sufficient to fully identify similarity.

3I am grateful to David K. Levine for suggesting this example.

7

Page 8: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

i jFigure 1: Which triangle has the largest area?

i’ jFigure 2: Which triangle has the largest area?

Figure 3: Which star has more points?

Figure 4: Which star has more points?

8

Page 9: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

it is much easier to discriminate among the options in Figure 4. According to

our definition of similarity, we infer from this pattern of choices that the pair in

Figure 4 is more similar than the pair in Figure 3.

To see how our model allows us to reconcile Tversky’s similarity hypothesis

with the attraction effect, suppose the decision maker faces a choice between two

alternatives i and j. Let k be a third alternative with the same utility as i and let

the pair i, k be more similar than the pairs i, j and j, k. In Theorem 3, we show

that introducing k reduces the probability of choosing j more than it reduces the

probability of choosing i, and may even increase the probability of choosing i,

for sufficiently low levels of information precision (i.e., at the beginning of the

random choice process). Eventually, as information becomes more precise, the

opposite occurs: the introduction of k hurts the similar i more than it hurts the

dissimilar j.

To better isolate the role of similarity, suppose that all three alternatives i,

j and k have the same utility. In Corollary 4 we show that, when the signal

correlation between i and k is taken to be arbitrarily close to one, both effects

obtained above become more extreme. For times close to zero, i and k are each

chosen with probability close to one half, while j is chosen with probability close

to zero. As time goes to infinity, or as information becomes arbitrarily precise,

i and k are each chosen with probability close to 1/4, and j is chosen with

probability close to 1/2. Hence when all three alternatives are equally desirable,

similarity generates a form of attraction effect early in the process, and that effect

tends to disappear as the decision maker learns the utilities of the alternatives

with more precision. As the correlation between i and k goes to one, these

alternatives are eventually treated as duplicates in the random choice process.

In Theorem 5 we tackle the attraction effect. We show that, if alternative k is

sufficiently inferior and correlated with i, the attraction effect never disappears:

starting arbitrarily early in the random choice process, introducing k increases

the probability of choosing i to a value strictly larger than one half (violating

monotonicity), and this remains true even as time goes to infinity.

1.1 Related Literature

The attraction effect and other related decoy effects are examples of choice behav-

ior that systematically depend on the context in which the decisions are made.

9

Page 10: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

Context dependent choice challenges the principle of utility maximization in in-

dividual decision making. In particular, violations of the weak axiom of revealed

preference break the identification of preferences and choices in standard revealed

preference theory.

To accommodate context dependence, recent advances in decision theoretic

models have relaxed the weak axiom while maintaining the assumption of deter-

ministic choice behavior. In Manzini and Mariotti (2007), a decision maker may

exhibit context dependent behavior when the choice procedure consists of apply-

ing two or more different rationales sequentially. In Masatlioglu et al. (2012),

context-dependence arises from limited attention. For example, a consumer who

prefers the Mac may instead choose the Dell when the Toshiba is introduced,

because the presence of a third alternative induces her to fail to pay attention to

the Mac. In a similar vein, Ok et al. (2012) study context-dependent behavior

that arises when one of the alternatives in the menu serves as a reference point,

again limiting the attention of the decision maker to a subset of the choices.

By maintaining the assumption of deterministic behavior while relaxing the

weak axiom of revealed preference, this literature pushes the scope of determinis-

tic models to their limit. But it is possible to go further. As pointed out by Gul et

al. (2012), to understand context dependent behavior, there are great advantages

in modeling choice as a stochastic phenomenon. The argument is simple: while

it may be reasonable to expect a decision maker to choose alternative i from

the set i, j on one occasion and to choose j from the set i, j, k on another

occasion, it is unlikely that we would observe these choices on every occasion.

Therefore, models which rely on deterministic violations of the weak axiom are

at best as likely to find empirical support as the standard model. In this sense, a

probabilistic model of choice may be more appropriate for measuring an individ-

ual’s tendency to choose i over j. In this paper, modeling choices as stochastic

is natural, because the similarity puzzle is formulated in probabilistic terms. All

nuanced effects of similarity on choice would disappear once we formulate the

problem deterministically.

Random choice models have a long history, having originated in psychophysics.

Until the mid twentieth century theoretical developments were limited to models

of binary choice. In his seminal work on individual choice behavior as a prob-

abilistic phenomenon, Luce (1959) introduces the first theoretical restriction for

10

Page 11: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

choice across menus with more than two alternatives, in the form of his random

choice hypothesis:

The ratio between the probability with which option j is chosen from a

set of options to the probability with which k is chosen from the same

set is constant across all sets that contain j and k.

which leads to the Luce model of random choice. McFadden (1974) constructs

a random utility for the Luce model, which gives rise to the widespread use of

conditional logit analysis in discrete choice estimation.

The random choice hypothesis imposes a strong restriction on the patterns of

substitutability between alternatives. Debreu (1960) provides an early example

of this limitation of the Luce rule, later referred to as the “duplicates problem”

in the discrete estimation literature. A classic version of Debreu’s example is the

red bus/blue bus paradox: suppose that the decision maker is indifferent between

taking a train and a blue bus, so that each option is chosen with probability one

half. She is also indifferent between taking the blue bus or an identical bus

that happens to be red, so she also chooses each bus with probability one half.

Intuitively, since the decision maker doesn’t care about the color of the bus,

the train should still be chosen with probability one half when both buses are

available. However, in the Luce model this pattern of choices implies that the

train is chosen with probability 1/3 when the three options are available.

Debreu’s example illustrates an extreme case in which two options are treated

like a single option by the decision maker. But even when options are not exactly

treated as duplicates, the decision maker may exhibit a pattern of substitutability

that is incompatible with the Luce rule. It is in this vein that Tversky (1972b)

proposes the similarity hypothesis:

The addition of an alternative to an offered set ‘hurts’ alternatives that

are similar to the added alternative more than those that are dissimilar

to it.

Many random choice models allow for substitutability patterns in line with

Tversky’s similarity hypothesis. Commonly used random utility models in dis-

crete choice estimation, such as the Generalized Extreme Value family and the

multinomial probit, accommodate complex substitution patterns by allowing cor-

relation in the error terms of random utility (see for example Train (2009)). In a

11

Page 12: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

more theoretical vein, Tversky’s (1972a) elimination-by-aspects model and, more

recently, Gul et al.’s (2012) attribute rule are two other examples.

All the random choice models we mentioned above are random utility max-

imizers, and therefore they all share the property of monotonicity. A random

choice rule satisfies monotonicity if the probability of an existing alternative being

chosen can only decrease when a new alternative is added. Thus, the attraction

effect presents a challenge to random utility models. Any probabilistic model

of individual decision making that accommodates behavior compatible with the

attraction effect must therefore depart from random utility. The model proposed

in this paper departs from random utility maximization and allows violations of

monotonicity.

The majority of the recent decision theoretic developments that accommodate

behavior compatible with the attraction effect posit some form of bounded ra-

tionality or limited attention on the part of the decision maker. In contrast, our

model describes a decision maker who is fully aware of every option in her oppor-

tunity set, but only learns the utility of the available alternatives gradually. In

our model, context dependence arises when similarity allows the decision maker

to learn the relative ranking of some pairs of alternatives faster than others.

The paper is organized as follows. In the next Section we introduce the model

and provide a baseline characterization. In Section 3 we characterize the Bayesian

probit process, define our notion of similarity and show that it is captured by sig-

nal correlation. Section 4 contains our main results and addresses the similarity

puzzle. In Section 5 we compare the Bayesian probit process to existing random

utility models in the literature. In Section 6 we offer an application of the model

to a setting where noise correlation depends on observable attributes of the choice

alternatives. We conclude in Section 7. All proofs are in the Appendix.

2 Learning Process

Let A be the grand set of choice objects and let A denote the collection of all

finite subsets of A. We write A+ = A \ ∅ for the non-empty finite subsets.

Random choice rules are a generalization of deterministic choice explicitly treats

choice behavior as probabilistic.

12

Page 13: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

2.1 Random Choice Rules

Definition. A function ρ : A × A+ → [0, 1] is a random choice rule if for all

b ∈ A+ we have ρ(b, b) = 1 and ρ(a, b) =∑

i∈a ρ(i, b).

To simplify the statements below, we identify each i ∈ A with the singleton

i. The value ρ(i, b) specifies the probability of choosing i ∈ A, given that the

selection must be made from the choice set b ∈ A+. The first equation in the

definition of a random choice rule is a feasibility constraint: ρ must choose among

the options available in b. The second requirement makes ρ(·, b) a probability over

b. A random choice rule ρ is monotone4 if for every a ⊂ b ∈ A+ and i ∈ a we

have ρ(i, a) ≥ ρ(i, b). For monotone random choice rules, the probability of

choosing an alternative i can only decrease when new alternatives are added to a

menu. The three examples below illustrate classic random choice rules on which

applications have relied in the literature. All of them are monotone.

Example 1 (Rational Choice). The decision maker has a complete and transitive

preference relation <⊂ A×A. Her random choice rule is specified as ρ(i, b) = 1

whenever i ∈ b is the unique best alternative according to <. If b has exactly

two alternatives which are best according to <, they are chosen with the same

probability. When b has more than two best alternatives according to <, we

only require each of them to be chosen with strictly positive probability, every

other alternative to be chosen with zero probability, and the resulting rule to be

monotone.5 ♦

Example 2 (Random Utility). Most econometric models of discrete choice such

as logit and probit are special examples of random utility maximizers. To fix

ideas, suppose the grand set of alternatives A is finite. Let U be a collection

of strict utility functions on A. A random utility π is a probability distribution

on U . A random choice rule maximizes a random utility π if the probability of

choosing alternative i from menu b equals the probability that π assigns to utility

functions that attain their maximum in b at i. ♦

4Monotone random choice rules are sometimes called regular in the literature.5Alternatively, we could have imposed uniform tie-breaks for all sets with multiple best alternatives.

The resulting rule would automatically satisfy monotonicity. This alternative definition turns out to be toorestrictive for the purposes of the theory developed later on.

13

Page 14: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

Example 3 (Probit Rule). The probit rule is a random utility maximizer with

Gaussian errors. Suppose the set A has n alterantives. The probit rule specifies

a random vector X : Ω→ Rn with a joint normal distribution, X ∼ N (µ,Σ). In

each menu a ⊂ A, alternative i ∈ a is chosen with probability ρ(i, a) = PXi >

Xj, ∀j ∈ a, j 6= i. ♦

We will sometimes restrict our attention to binary choices. A binary random

choice rule is obtained by restricting the domain of a random choice rule to sets

of two alternatives. To simplify the expression for choice probabilities in binary

menus we will write

ρ(i, j) := ρ(i, i, j).

Definition. A utility index µ : A → R and a function over pairs γ : A × A →[−1, 1] are utility signal parameters on A if

(i) For each i ∈ A, γ(i, i) = 1;

(ii) For each i, j ∈ A, γ(i, j) = γ(j, i);

(iii) For each enumeration of a finite menu b = 1, 2, . . . , n ⊂ A, the n × n

symmetric matrix whose row i and column j is given by γ(i, j) is positive

definite.

To simplify notation, we write µi = µ(i) and γij = γ(i, j).

Before introducing our model, it is useful to understand how utility signal

parameters (µ, γ) can represent the probit rule of Example 3. Let (µ, γ) be

utility signal parameters. For each enumeration of a menu b = 1, 2, . . . , n, let

the random utility for the alternatives in b be given by the n-dimensional random

vector X, where X is joint normally distributed and has mean (µ1, µ2, . . . , µn)

and covariance matrix (γij). Then by construction ρ(i, b) = PXi > Xj,∀j 6= idefines a probit rule.

The classic interpretation in econometric work for a random utility maximizer

ρ is that it describes the frequencies of choices of a population of deterministic

utility maximizers. The randomness arises from the modeler’s imperfect knowl-

edge of the tastes of each individual. Even though at the individual level the

choice is deterministic, due to unobservable characteristics the econometrician is

only able to predict the choices of the individual in probabilistic terms.

14

Page 15: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

We can alternatively interpret random utility maximizers as a model of one-

shot learning. When confronted with a set of alternatives, the decision maker

receives a precise signal of how those alternatives are ranked according to her

preferences. While tastes are stochastic, each choice realization is interpreted as

perfectly reflecting the ranking of alternatives at the moment in which the choice

was made.

Our model of random choice departs from random utility maximization and is

based on a gradual learning interpretation. Ultimately preferences are described

by a rational, complete and transitive ranking of all alternatives. But ex-ante

the decision maker has no information about how the alternatives in a particular

menu b are ranked. Instead, the decision maker has prior beliefs about the distri-

bution of her preferences in the same kind of choice situation. These prior beliefs

are symmetric, i.e., all alternatives are perceived in exactly the same way ex-ante.

Before making a choice from the menu b, the individual has the opportunity to

contemplate the alternatives. This contemplation takes the form of a signal that

conveys partial information about the true utility of the alternatives. From the

point of view of the observer, this formulation yields the following random choice

rule.

2.2 Bayesian Probit Rule

The model can be described as the following modification of the probit rule. The

decision maker is described by utility signal parameters (µ, γ) and additionally by

prior parameters (m0, s20) where m0 ∈ R and s0 > 0. The parameter µ describes

the true utility of each alternative and is unknown to the decision maker. The

decision maker has a prior belief about µ that is jointly Gaussian, where the

utility of each alternative is seen as an independent draw from the same Gaussian

distribution with mean m0 and variance s20. When facing a choice from a menu

b = 1, 2, . . . , n, the individual chooses as if observing a draw from the random

utility vector X, updating beliefs about the utility of each alternative according

to Bayes’ rule, and picking the alternative with the highest posterior mean. We

refer to the resulting random choice rule as a Bayesian probit rule with utility

signal parameters (µ, γ) and prior parameters (m0, s20).

How much the decision maker learns about the alternatives before making a

choice will depend on the precision of the signals. A natural question to ask in

15

Page 16: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

this model is how random choice behavior changes when the information about

utility is made more precise. For example, one could do comparative statics

analysis of random choice behavior when the decision maker is allowed to draw

an additional independent signal from the same distribution. Choices arising from

more precise information should exhibit fewer mistakes. More informed behavior

can be interpreted as arising, for example, from a smarter decision maker, from

a decision maker contemplating the alternatives at a closer distance, or from a

decision maker who contemplated the alternatives for a longer period of time

before making a decision.

Our model describes random choice behavior as the precision of the signal for

a random choice rule varies continuously from pure noise to arbitrarily precise

signals. To allow such comparisons, we study an entire family of random choice

rules, indexed by an ordered set. We call the resulting object a random choice

process.

2.3 Random Choice Process

To investigate the effects of increasing the precision of the utility signal over ran-

dom choice behavior, we now introduce a family of random choice rules indexed

by an ordered set T. For concreteness, we can think of T = [0,∞) as time.

Definition. A function ρ : A×A+×T→ [0, 1] is a random choice process if for

every time t ∈ T the restriction ρ(·, ·, t) is a random choice rule.

A random choice process describes the probability distribution of choices for

each finite set of alternatives b ∈ A+ and each point in time t ∈ T. Time can

be seen merely as an abstract ordering of the different choice rules that compose

the random choice process, but it can also be interpreted in the more literal

sense of a temporal dimension. For example, a random choice process can be

thought of as encoding observable behavior in experiments of individual decision

making where both the set of alternatives and the time available to contemplate

the alternatives can be manipulated.

With the literal interpretation of T as time, it is important to note that we

do not treat the entire path of each realization of a choice process as observ-

able. While this stronger enrichment of choice data would be sufficient, and

16

Page 17: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

is considered in theoretical work at least since Campbell (1978), and in recent

experimental work in Caplin and Dean (2010), it is not necessary here.

Henceforward, we write ρt(i, a) instead of ρ(i, a, t). We use ρt to refer to the

random choice rule obtained from the random choice process ρ by restricting its

domain to a fixed time t. A random choice process ρ is continuous if for every

i ∈ a ∈ A+ the mapping t 7→ ρt(i, a) is continuous. We are interested in the

random choice process of a decision maker whose choices are based on better

information as time increases, approaching rationality as time goes to infinity.

We say that a random choice process ρ ultimately learns if there is a random

choice rule ρ∞ satisfying the conditions of rational choice in Example 1, such

that for each i ∈ a ∈ A+ we have ρt(i, a)→ ρ∞(i, a) as t→∞. A random choice

process ρ starts uninformed when for all menus b ∈ A+ and all alternatives

i, j ∈ b, we have ρ0(i, b) = ρ0(j, b). In particular, for a binary random choice

process we have ρ0(i, j) = ρ0(j, i) for all i, j.

Definition. A random choice process ρ is a learning process if it is continuous,

starts uninformed, ultimately learns, and for all pairs of alternatives i, j ∈ A the

mapping t 7→ ρt(i, j) is either strictly increasing, strictly decreasing, or constant.

2.4 Bayesian Probit Process

This subsection completes the description of our model. The Bayesian probit

process is a random choice process ρ in which every component random choice

rule ρt is a Bayesian probit rule with the same utility signal parameters (µ, γ).

As t increases, ρt is based on more precise information. To introduce the formal

description, consider for a moment the case of discrete time t ∈ 1, 2, 3, . . . . In

this case ρ1 is just the Bayesian probit rule described in Subsection 2.2 above.

Given ρt, we let ρt+1 correspond to the random choice rule obtained from ρt

as if the decision maker could observe an additional independent utility signal

realization from the same distribution. The continuous time case is analogous to

the discrete case when we let the time interval go to zero.

Utility signals as a diffusion process

The decision maker’s prior beliefs about the utility of each alternative are iden-

tically and independently distributed Gaussian with mean m0 ∈ R and variance

17

Page 18: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

s20 > 0. This prior reflects her information about the alternatives before starting

the learning process. Ex-ante, all alternatives look the same, reflecting the fact

that no signal has been observed. When the grand set of choice objects A has

enough variety, we can think of the prior as the actual distribution of the util-

ity parameter when the decision maker faces choice problems with alternatives

from A. With this interpretation, the decision maker has correct beliefs, given

the information available, and assuming that each finite menu is composed of an

independent random sample of objects from A.

Let 1, 2, . . . , n be an enumeration of the alternatives in a menu b ∈ A+. We

model choices from b at each point in time as being governed by a covert utility

signal process. The signal is interpreted as representing the decision maker’s

noisy perception of the desirability of each alternative over time. We model the

signal process X as a Brownian motion with drift, given by

dX(t) = µ dt+ Λ dW (t), X(0) = 0 (1)

where X(t) is an n-dimensional vector, µ ∈ Rn is a constant drift vector, Λ is a

constant n× n matrix with full rank and W (t) = (W1(t), . . . ,Wn(t)) is a Wiener

process.

The signal starts at X(0) = 0 almost surely. For each fixed time t > 0,

the vector X(t) has a joint Gaussian distribution with mean µt and covariance

matrix tΛΛ′. Each coordinate of X is a one-dimensional signal that corresponds

uniquely to one of the alternatives in the menu. When we refer to the signal X

for menu b ∈ A+, we have in mind an enumeration of the alternatives in b that

corresponds to the ordering of the coordinates in X.

We will assume that the signal processes that the decision maker observes

when facing different menus are consistent: the drift and covariance parameters

are always taken form the corresponding utility signal parameters (µ, γ) on A. For

example, when one of the alternatives is deleted from menu b, the corresponding

coordinate of X is removed, but otherwise the signal process remain the same.

2.5 Baseline characterization

We start by characterizing a baseline Bayesian probit process in which choices

evolve as if pairs of alternatives did not vary in similarity.

18

Page 19: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

We now introduce some additional notation. When ρ is a random choice rule,

the number ρ(i, a) is a probability value between zero and one. To simplify the

statements below, it will sometimes be convenient to express this quantity in the

scale [−∞,∞] by using the standard Gaussian cumulative distribution function

Φ. Given a random choice rule ρ, we define ρ : A×A+ → [−∞,∞] by

ρ(i, a) :=

−∞, if ρ(i, a) = 0

Φ−1 (ρ(i, a)) , if ρ(i, a) ∈ (0, 1)

+∞, if ρ(i, a) = 1.

(2)

Definition. The random choice rule ρt satisfies the Gaussian triangle condition

when, for all i, j, k ∈ A,

ρt(i, k) = ρt(i, j) + ρt(j, k). (3)

A random choice process ρ satisfies the Gaussian triangle condition when it is

satisfied by every ρt.

The Gaussian triangle condition is a cross-restriction on how the random

choice rule is able to discriminate among pairs of alternatives. It says that if we

know how well the random choice rule discriminates among the pairs (i, j) and

(j, k), then we should also know how it discriminates between i and k. Note that

the condition is expressed using the Gaussian transformation ρ defined in (2).

This condition arises naturally in a model with Gaussian noise. Analogous con-

ditions would hold for other distributions, e.g., the logit model.

Definition. A random choice process ρ has a Gaussian learning rate when for

all alternatives i, j, k, l ∈ A and all times s, t ∈ T

ρt(i, j)× ρs(k, l) = ρs(i, j)× ρt(k, l). (4)

To understand the substance of this last property, suppose for a moment that

t > s and note that when the denominators are different from zero we can rewrite

condition (4) asρt(i, j)

ρs(i, j)=ρt(k, l)

ρs(k, l).

The left hand side ratio is a measure of how the discrimination between alterna-

tives i and j evolved from time s to time t. The Gaussian learning rate condition

requires that this ratio be constant across pairs of alternatives: if we know how

19

Page 20: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

the discrimination between i and j changed from s to t, and we know how alter-

natives k and l were discriminated at time s, then we must know how those two

alternatives are discriminated at t.

Finally, we introduce a simple notion of time change. A function f : T → Tis a time change when it is strictly increasing and continuous. We say that ρ

a Bayesian learning process up to a time change when there is a time change f

such that ρ′t = ρf−1(t) is a Bayesian learning process. Of course, every Bayesian

learning process ρ can be shown to be Bayesian learning process up to a time

change by taking f(t) = t.

Theorem 1 (Baseline Model). A binary random choice process ρ is a simple

Bayesian probit process (up to a time change) if and only if it is a learning process,

has a Gaussian learning rate and satisfies the Gaussian triangle condition.

3 Revealed Similarity

In classical revealed preference theory, choice data takes the form of a determin-

istic choice function. A deterministic choice function picks one or more elements

from a given menu of alternatives with probability one. The Weak Axiom of Re-

vealed Preference (WARP) imposes consistency on choice data across different

menus of alternatives. If an element j is picked from menu a when k was also

available, then WARP does not allow k to be picked in any other menu that con-

tains j, unless j is picked as well. The theory allows us to tease out a preference

relation from choice data that satisfies WARP.

Our choice data takes the form of a random choice process ρ. Since the choice

data is stochastic, a random choice process finely captures the tendency of a

decision maker to pick an alternative j over an alternative k. Our theory goes

beyond classical revealed preference to explain comparability: how easy is it for

a decision maker to choose the best alternative out of the pair j, k?

We define comparability as how easy it is to compare two options. Compa-

rability depends on preference. Consider a choice from the pair j, k where j is

actually preferred to k. Comparability is directly proportional to how likely the

decision maker is to choose j. In other words, the less likely the decision maker

is to make a mistake and pick k, the easier it is to compare the pair j, k.

Let ρ be a random choice rule. We define a preference relation % on A from

20

Page 21: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

the binary random choices in ρ in the following manner. Let i % j and say that

alternative i is at least as good as j if

ρ(i, j) ≥ 1

2≥ ρ(j, i).

Write ∼ for the symmetric part of % as usual. When ρ satisfies the assumptions

of Theorem 1, % is complete and transitive and represented by µ. Hence under

those assumptions the random choice rule reveals the preference ranking %.

Analogously, from a given ρ we can define a similarity relation % over pairs

of alternatives on A × A. A pair of alternatives i, j ∈ A is more similar than a

pair i′, j′ ∈ A, denoted as (i, j)%(i′, j′), when i ∼ i′, j ∼ j′ and

ρ(i, j)ρ(j, i) ≤ ρ(i′, j′)ρ(j′, i′) (5)

and we say that i is more similar to j than to j′ when we take i = i′ above.

To interpret this notion of similarity, note that i and i′ are equally ranked

according to preference, and also j and j′ are equally ranked according to pref-

erence. Insofar as preference is concerned, the alternatives in the pair (i, j) are

compared exactly as the alternatives in the pair (i′, j′). Expression (5) says that

the random choice rule discriminates the pair (i, j) at least as well as the pair

(i′, j′). A strict inequality in (5) therefore indicates that (i, j) must be easier to

compare.

For a random choice process ρ satisfying the assumptions of Theorem 1, there

is never strict inequality in (5). In this sense, Theorem 1 characterizes a baseline

case in which all pairs of alternatives are equally similar. The characterization

is useful because it allows us to identify the Gaussian triangle condition as the

property responsible for all alternatives being equally similar.

To understand the role of the Gaussian triangle condition, let i ∼ i′ and j′ ∼ j.

This implies that ρ(i, i′) = ρ(j, j′) = 1/2 and therefore ρ(i, i′) = ρ(j, j′) = 0. By

the Gaussian triangle condition we have

ρ(i, j) = ρ(i, i′) + ρ(i′, j) = ρ(i, i′) + ρ(i′, j′) + ρ(j′, j) = ρ(i′, j′)

and therefore ρ(i, j) = ρ(i′, j′), so that the inequality in (5) is never strict.

Hence, in order to study the consequences of varying degrees of similarity

for random choice behavior, we need to relax the Gaussian triangle condition.

In the next subsection, we will require the Gaussian triangle condition to fail.

21

Page 22: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

Moreover, in order to identify the similarity relation, we will require that the

grand set of alternatives A have a pivotal alternative and enough variety in the

way that the Gaussian triangle condition is violated by triples that include the

pivotal alternative.

3.1 A distinct alternative

In order to facilitate the introduction of the next assumption, we will restate the

Gaussian triangle condition in a more convenient form. Recall that the Gaussian

triangle condition requires that for all triples of alternatives i, j, k ∈ A,

ρ(i, k) = ρ(i, j) + ρ(j, k).

Therefore, when ρ(i, k) 6= 1/2, so that ρ(i, k) 6= 0, we can rewrite this condition

as a ratioρ(i, j) + ρ(j, k)

ρ(i, k)= 1

or equivalently in the following convenient form

1−(ρt(i, j) + ρt(j, k)

ρt(i, k)

)2

= 0.

Definition. A random choice rule ρ has a distinct alternative k∗ when, for every

pair i 6∼ j ∈ A and every ε > 0, there exist a pair of alternatives j′, j′′ ∈ A such

that j′ ∼ j ∼ j′′, ρt(j, k∗) = ρt(j

′, k∗) = ρt(j′′, k∗) and

1−(ρt(i, k

∗) + ρt(k∗, j′′)

ρt(i, j′′)

)2

< ε− 1 < 1− ε < 1−(ρt(i, k

∗) + ρt(k∗, j′)

ρt(i, j′)

)2

.

To understand the role of the distinct alternative k∗ in the condition above,

note that it requires that for every pair of alternatives i, j in which i is not equally

preferred to j, we can find two alternatives j′, j′′ equally preferred to j, such that

the triples i, j′, k∗ and i, j′′, k∗ sufficiently violate the Gaussian triangle condition

in opposite directions. So this condition embeds both an assumption about the

existence of a distinct or pivotal alternative k∗, and also a condition about a rich

variety of options in the set A. Substituting the distinct alternative assumption

for the Gaussian triangle condition, one can obtain a Bayesian Probit Process in

which γ is not orthogonal.

Suppose M is a property of a random choice rule. We say that a random

choice process ρ eventually has property M when there is a time T ∈ T such that

for all times t > T the random choice rule ρt has property M .

22

Page 23: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

Proposition 2 (Ranking stabilization). For every menu b ∈ A+ and every pair

of alternatives i j in b, the Bayesian learning process eventually chooses i more

often than j in b.

4 Similarity Puzzle

In order to state the main results, throughout this section we assume that ρ is

a Bayesian Probit Process with a distinct alternative. We use % to denote the

preference relation over alternatives corresponding to ρ, and we use % to denote

the similarity relation over pairs corresponding to ρ. Choice alternatives are

labeled 1, 2 and 3.

In the Theorems below, alternatives 1 and 2 are equally desirable: 1 ∼ 2. So

alternatives 1 and 2 are chosen with probability one-half from menu 1, 2 at any

date. We also assume that (1, 3) (1, 2) and (1, 3) (2, 3), i.e., the pair (1, 3) is

more similar than pairs (1, 2) and (2, 3). In other words, alternative 3 is more

similar to alternative 1 than to alternative 2.

Theorem 3 (Similarity Puzzle). There are times T1, T2 > 0 such that introducing

3 hurts 2 more than it hurts 1 for all times before T1, and the reverse occurs for

all times greater than T2.

Early in the learning process, the decision maker is relatively unfamiliar with

the options. The relative ranking of similar options will be learned faster than

dissimilar ones. Theorem 3 shows that this can go against Tversky’s similarity

hypothesis early in the learning process, but that as the decision maker becomes

more familiar with the options this “attraction effect” disappears. The following

Corollary illustrates both effects in their most extreme form.

Corollary 4. Let 1 ∼ 2 ∼ 3 and ε > 0. If the pairs (1, 2) and (2, 3) are equally

similar and the pair (1, 3) is taken sufficiently similar, we have

(i) for all t sufficiently small, ρt(1, 123), ρt(3, 123) > 1/2− ε; and

(ii) for all t sufficiently large, ρt(1, 123), ρt(3, 123) ∈ [1/4, 1/4 + ε).

Item (i) in the Corollary says that, when the decision maker is uninformed,

she will choose one of the alternatives of the very similar pair with probability

close to one. To understand this result, consider what happens at times close to

23

Page 24: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

zero, when utility signals are very imprecise. As we saw in Proposition 7, in the

limit as t goes to zero utility becomes irrelevant, and choice probabilities depend

solely on similarity. When the similarity of the pair (1, 3) is extreme, the decision

maker rapidly learns wether 1 is a better alternative than 3 or vice versa, before

learning much about anything else, and therefore alternative 2 will seldom be the

most attractive.

Item (ii) says that, in the limit as t goes to infinity, the alternatives in the very

similar pair (1, 3) are treated as a single option. In general, any small difference

in the utility of the alternatives will eventually be learned by the decision maker,

and the best alternative will be chosen with probability close to one. When the

three alternatives have the exact same utility, it is similarity, rather than utility,

that determines how ties are broken: alternative 2 is chosen with probability

close to one-half and an alternative in the very similar pair (1, 3) is be chosen

with probability close to one-half.

Our final result shows that the initial positive effect of similarity can persist

indefinitely when the similar options are not equally desirable. Moreover, sim-

ilarity can cause the probability of an existing option to increase, leading to a

violation of monotonicity and to the classic attraction effect. Recall that alter-

natives 1 and 2 are equally desirable, and therefore chosen from the menu 1, 2with probability one-half at any date.

Theorem 5 (Attraction effect). For every ε > 0, if alternative 3 is sufficiently

inferior, then alternative 1 is chosen from menu 1, 2, 3 with probability strictly

larger than one-half for all t > ε.

Hence adding a sufficiently inferior alternative 3 to the menu 1, 2 will boost

the probability of the similar alternative 1 being chosen —the phenomenon known

as the attraction effect. The Theorem shows that the attraction effect appears

arbitrarily early in the choice process, and lasts indefinitely. For a concrete

illustration of how the attraction effect arises in the context of our model, consider

the following numerical example.

Example 4 (Attraction Effect). The set of alternatives is given by A = 1, 2, 3.The decision maker’s prior beliefs are i.i.d. standard normally distributed. Al-

ternatives 1 and 2 are equally desirable with utility µ1 = µ2 = 3. Alternative 3

is less desirable with µ3 = −3. The inferior alternative 3 is similar to alternative

24

Page 25: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

1 with γ13 = 1/2, while γ12 = γ23 = 0.

When choosing from the menu 1, 2 the decision maker can never discrimi-

nate among the two options: each option is chosen with probability one-half at

any given time. Figure 5 describes what happens once the inferior but similar

alternative is introduced.

0.0001 0.01 1 100 100000

12

10.0001 0.01 1 100 10000

0

12

1

Time Hlog scaleL

ChoiceProbability

Figure 5: The attraction effect

Figure 5 shows time in logarithmic scale, to better visualize the start of the

choice process. Note that while the graph shows 10000 units of time, the first

half covers the first unit of time. The top curve (solid line) is the probability that

alternative 1 is chosen; the middle curve (dotted line) corresponds to the choice

probability for alternative 2; and the bottom curve (dashed line) corresponds to

the choice probability of alterative 3.

At the beginning of the choice process, when choices are based on relatively

little information about the alternatives, the probability of choosing the inferior

alternative 3 rapidly decreases, while the probability of choosing alternative 1 is

boosted up. Similarity allows the decision maker to learn the relative ranking

of alternatives 1 and 3 faster than the relative ranking of the other pairs. This

leads to the attraction effect: the addition of the inferior but similar alternative

3 causes the probability of choosing alternative 1 to increase.

25

Page 26: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

The attraction effect is sustained as information gets more precise. Since

alternatives 1 and 2 are equally desirable, eventually each one is chosen with

probability converging to one-half. But while the choice probability of alternative

1 tends to one-half from above, the choice probability of alternative 2 tends to

one-half from below. The choice probability of alternative 1 very early jumps

above one-half, violating monotonicity, and remains there for the entire length of

the choice process. By Theorem 5, we know this violation of monotonicity can

happen arbitrarily early in the choice process, if the added alternative 3 is taken

sufficiently inferior. ♦

5 Naıve probit versus Bayesian probit

Proposition 6 (Equivalence of binary choices). A binary random choice process

ρ is a Bayesian probit process if and only if it is a naıve probit process. In the

affirmative case, they can share the same utility signal structure.

Binary choices alone do not allow us to distinguish a Bayesian probit process

from a naıve probit process, in either the baseline characterization or the main

characterization above. We now show that while behavior is indistinguishable

when restricted to binary choices, the overlap between the two models disappears

when choice from menus with three or more alternatives are considered. To see

this it is sufficient to consider ternary choice. The next result shows that in

menus of three alternatives, the Bayesian probit process and the naıve probit

process exhibit radically different choice behavior from the very beginning. Let

ρ be a Bayesian probit process and let ρ be a naıve probit process sharing the

utility signal parameters (µ, γ). Let b = 1, 2, 3.

Proposition 7 (Ternary choice at time zero). In the limit as t→ 0+, the naıve

probit rule and the Bayesian probit rule choose alternative 1 with probability

ρ0(1, b) =1

4+

1

2πarctan

((1+γ23−γ12−γ13)√

2(1−γ12)2(1−γ13)−(1+γ23−γ12−γ13)2

)

ρ0(1, b) =1

4+

1

2πarctan

((1+γ12)(1+γ13)−γ23(1+γ12+γ13+γ23)√

(3+2γ12+2γ13+2γ23)(1+2γ12γ13γ23−γ212−γ213−γ223)

)

and analogous expressions hold for alternatives 2 and 3.

Corollary 8. The Bayesian probit process and the naıve probit process are indis-

tinguishable if and only if they share utility signal parameters and γ is orthogonal.

26

Page 27: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

Proposition 7 and its corollary show that when we observe choice behavior

from menus with more than two alternatives, the Bayesian probit process and

the naıve probit process exhibit incompatible behavior, except for the knife-edge

case of a common parameterization with orthogonal noise. Figure 5 illustrates

how choices for both processes start at time zero from menu b = 1, 2, 3. The

Figure shows how the probability of choosing alternative 1 changes as a function

of the correlation γ12 between alternatives 1 and 2. The remaining correlation

parameters γ13 and γ23 are fixed at zero. The solid line corresponds the Bayesian

probit rule, and the dashed line corresponds to the naıve probit rule. Note that

the curves intersect when γ12 = 0, where noise is orthogonal and both processes

start with a uniform distribution, i.e., each alternative is chosen with probability

1/3. As can be seen in Proposition 7, choices at time zero do not depend on µ.

Proposition 7 and its corollary relied on choice behavior at the start of the

learning process to distinguish between the Bayesian probit and the naıve probit.

We can in general distinguish them from observed choice behavior, except in

the knife-edge case of orthogonal utility signals. Note that since each random

choice rule ρt in the naıve probit is a random utility maximizer, the naıve probit

process must satisfy monotonicity. Our main results in the previous section

demonstrated another distinction of the Bayesian probit from the naıve probit

in terms of observable choices, namely, the violation of monotonicity.

6 Application: Relating Noise to Observables

In experimental settings where the attraction and compromise effects are ob-

served, choice alternatives are often described by two attributes. For example, in

Simonson (1989) cars are described by ‘ride quality’ and ‘miles per galon’, brands

of beer are described by ‘price of a six-pack’ and ‘quality rating’ and brands of

mouthwash by ‘fresh breath effectiveness’ and ‘germ-killing effectiveness’. As-

sume that each alternative xi is described by a vector (xi1, xi2) ∈ R2. The extra

structure allows us to build a model in which the noise correlation depends ex-

plicitly on the observable characteristics. This model allows us to distinguish

violations of monotonicity based on the location of the new alternatives in the

attribute space. For example, it allows us to distinguish the attraction effect

from the compromise effect.

27

Page 28: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

-1.0 -0.5 0.5 1.0

0.1

0.2

0.3

0.4

0.5

Figure 6: Probability of alternative 1 being chosen from menu b = 1, 2, 3 as a functionof the correlation γ12 between the utility signals of alternatives 1 and 2, in the limit ast is taken arbitrarily close to zero. The solid line corresponds to the Bayesian learningmodel. The dashed line corresponds to the naıve learning probit. The correlations γ13 andγ23 are fixed equal to zero. Since choices at time zero are determined solely by γ, and theparameterization is symmetric from the point of view of alternatives 1 and 2, the graphsimultaneously describes the choice probabilities for alternative 2. When γ12 is zero theutility signal structure is simple, choices at time zero correspond to the uniform distributionfor both models, and therefore the two lines cross at exactly one third. As γ12 reaches one,alternatives 1 and 2 become maximally similar and the naıve learner chooses each of themwith probability 1/4. In contrast, the Bayesian learner chooses alternatives 1 and 2 withprobability 1/2 and alternative 3 with probability zero.

28

Page 29: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

Example 5 (Parameterization of covariance matrix in multinomial probit).

Hausman and Wise (1978) propose the following parameterization of the co-

variance matrix for a multinomial probit, when each choice alternative xi is de-

scribed by two positive coordinates (xi1, xi2) ∈ R2. In their specification, the

random utility for alternative xi is given by

U(xi) = β1xi1 + β2xi2

where β1, β2 are independent and normally distributed with the same variance

σ2 > 0. This can be interpreted as a consumer with a linear utility over object in

a two-dimensional commodity space, where the linear weights of each dimension

are random. The common weights will cause the utility for different alternatives

to be correlated. Specifically we have

Corr(U(xi), U(xj)) =xi1xj1σ

2 + xi2xj2σ2

√σ2(xi1 + xi2)

√σ2(xj1 + xj2)

=xi1xj1 + xi2xj2||xi||||xj||

= cos(xi, xj)

so that the correlation coefficient for the utilities of alternatives xi and xj cor-

responds to the cosine of the angle formed by the two alternatives in the two-

dimensional commodity space. ♦

Our behavioral notion of similarity translates immediately to this particular

setting under the simple parameterization of Example 5. Since similarity is cap-

tured by noise correlation, two alternatives are maximally similar when they lie

on the same ray departing from the origin. In this case they form an angle of

zero radians, and the correlation γij is equal to the cosine of zero, which is one.

This corresponds to the intuition that when alternatives clearly dominate each

other in all attributes, comparison is facilitated. On the other hand, if alterna-

tives form an angle close to ninety degrees, the cosine of their angle is close to

zero, and noise is uncorrelated. Intuitively, since the alternatives do not share

any physical attributes, their comparison is hard.

29

Page 30: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

i

j

θ

γij = cos θ

Figure 7: The noise correlation parameter γij equals the cosine of the angle formed by i andj in R2.

7 Conclusion

We proposed a model of random choice in which the decision maker initially has

a coarse and symmetric perception of the utility of the alternatives and before

making a choice engages in a learning process about his own ranking of the avail-

able options. The learning process is interrupted when a choice must be made.

We characterized the random choice behavior of such a decision maker when in-

formation about utility follows a diffusion process. We showed conditions that

allowed us to identify a preference ranking over the alternatives and, in addition,

a similarity ranking over pairs of alternatives. Both rankings are subjective and

revealed by the choice process.

In the model, the intensity with which a random choice rule discriminates

between two alternatives in a menu depends on three factors: the relative rank-

ing of the alternatives according to preference; the similarity of the alternatives;

and the overall precision of the information available to the decision maker. We

modeled choices as a random choice process in which the precision of the infor-

mation can be interpreted as the amount of time available to contemplate the

options. Discrimination improves when alternatives are more distant according

to preference, when the alternatives are more similar, and with the amount of

30

Page 31: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

time the decision maker is allowed to contemplate the alternatives.

The model captures similarity in the form of utility signal correlation. Signal

correlation causes the learning process to be context dependent, and therefore

choices are also context dependent. By identifying a notion of similarity that is

independent of preference, this model allowed us to solve the similarity puzzle of

random choice.

By allowing context dependence through noise correlation, the model has

the potential to address other empirical regularities found in the literature and

related to context dependence, such as the status quo bias and the compromise

effect. Applications of the model to these phenomena is left for future work.

References

Campbell, Donald E., “Realization of Choice Functions,” Econometrica, 1978,

46 (1), pp. 171–180.

Caplin, Andrew and Mark Dean, “Search, Choice, and Revealed Preference,”

Theoretical Economics, Forthcoming 2010.

Debreu, Gerard, “Review: Individual Choice Behavior,” The American Eco-

nomic Review, 1960, 50 (1), 186–188.

Gul, Faruk, Paulo Natenzon, and Wolfgang Pesendorfer, “Random

Choice as Behavioral Optimization,” Working Paper, Princeton University

February 2012.

Hausman, Jerry A. and David A. Wise, “A Conditional Probit Model

for Qualitative Choice: Discrete Decisions Recognizing Interdependence and

Heterogeneous Preferences,” Econometrica, 1978, 46 (2), pp. 403–426.

Huber, Joel and Christopher Puto, “Market Boundaries and Product

Choice: Illustrating Attraction and Substitution Effects,” The Journal of Con-

sumer Research, 1983, 10 (1), pp. 31–44.

, J. W. Payne, and C. Puto, “Adding asymmetrically dominated alter-

natives: Violations of regularity and the similarity hypothesis,” Journal of

Consumer Research, 1982, 9 (1), 90–98.

31

Page 32: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

Keller, Godfrey, “Brownian Motion and Normally Distributed Beliefs,” Work-

ing Paper, University of Oxford December 2009.

Luce, R. Duncan, Individual Choice Behavior: a Theoretical Analysis, Wiley

New York, 1959.

Manzini, Paola and Marco Mariotti, “Sequentially Rationalizable Choice,”

American Economic Review, 2007, 97 (5), 1824–1839.

Masatlioglu, Y., D. Nakajima, and E.Y. Ozbay, “Revealed attention,” The

American Economic Review, 2012, 102 (5), 2183–2205.

McFadden, Daniel, “Conditional logit analysis of qualitative choice behavior,”

in Paul Zarembka, ed., Frontiers in Econometrics, New York: Academic Press,

1974, chapter 4, pp. 105–42.

, “Economic Choices,” The American Economic Review, 2001, 91 (3), 351–378.

Ok, E.A., P. Ortoleva, and G. Riella, “Revealed (P)reference Theory,”

Working Paper, August 2012.

Simonson, Itamar, “Choice Based on Reasons: The Case of Attraction and

Compromise Effects,” The Journal of Consumer Research, 1989, 16 (2), pp.

158–174.

Train, Kenneth, Discrete Choice Methods with Simulation, 2nd ed., Cambridge

University Press, 2009.

Tversky, A., “Choice by elimination,” Journal of mathematical psychology,

1972, 9 (4), 341–367.

, “Elimination by aspects: A theory of choice,” Psychological Review, 1972, 79

(4), 281–299.

8 Appendix: Proofs

Proof of Theorem 1 — Necessity

Suppose ρ is the Bayesian learning process with prior parameters (m0, s20) and

utility signal structure (µ, γ). Consider an enumeration of a finite menu b =

32

Page 33: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

1, 2, . . . , n. Prior beliefs about µ are jointly Gaussian with mean m(0) =

(m0,m0, . . . ,m0) and covariance matrix s(0) = s20I, where I is the n×n identity

matrix. An application of Bayes’ rule gives that posterior beliefs about µ after

observing the signal process X up to time t > 0 are joint Gaussian with mean

m(t) and covariance s(t) given by:

s(t) =[s(0)−1 + t(ΛΛ′)−1

]−1

m(t) = s(t)[s(0)−1m(0) + (ΛΛ′)−1X(t)

]

where it is immediate that s(t) is a deterministic function of time, while m(t)

is random and has a joint Gaussian distribution itself. The parameters of the

posterior mean m(t) are given by

E[m(t)] = s(t)[s(0)−1m(0) + t(ΛΛ′)−1µ

]

Var[m(t)] = ts(t)(ΛΛ′)−1s(t)′

The following Lemma is easy and the proof omitted. For a discussion of the

single dimensional case see Keller (2009).

Lemma 9. Beliefs converge to the true value: E[m(t)]→ µ, Var[m(t)]→ 0 and

s(t)→ 0.

Now consider a binary menu enumerated as b = 1, 2. When facing the

alternatives in menu b, the decision maker’s choices at t > 0 depend on the

realization of the random vector m(t) = (m1(t),m2(t)). Alternative 1 is chosen

if and only if m2(t) −m1(t) < 0. Since m(t) is Gaussian, m2(t) −m1(t) is also

Gaussian with mean and variance given by

(ts20(µ2 − µ1)

ts20 + 1− γ12,

2ts40(1− γ12)(ts20 + 1− γ12)2

)

hence

ρt(1, 2) = Pm2(t)−m1(t) < 0 = Φ

(√t

(µ1 − µ2)√2(1− γ12)

)

where again Φ denotes the cdf of the standard Gaussian distribution. When the

utility signal structure (µ, γ) is simple this reduces to

ρt(i, j) = Φ

(√t(µi − µj)√

2

)

33

Page 34: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

From this expression it is straightforward to verify that ρ is continuous, mono-

tonic, starts uninformed, ultimately learns, and has a Gaussian learning rate.

Finally, it is also easy to verify that it satisfies the Gaussian triangle condition:

ρt(i, j) + ρt(j, k) =√t(µi − µj + µj − µk)√

2= ρt(i, k).

Proof of Theorem 1 — Sufficiency

Suppose ρ′ is a continuous, monotonic random choice process that starts unin-

formed, ultimately learns, has a Gaussian learning rate and satisfies the Gaussian

triangle condition.

Suppose ρ′t(i∗, j∗) > 1/2 for some i∗, j∗, t∗.

Step 1: Time change

Let f : T→ T be given by

f(s) =

(ρ′s(i

∗, j∗)

ρ′t∗(i∗, j∗)

)2

, ∀s ∈ T

and note that it is well-defined since ρ′t∗(i∗, j∗) 6= 0. Since ρ′ is continuous and

monotonic, f is continuous and strictly increasing. So f is a time change. In

particular we have f(0) = 0 and f(t∗) = 1. We use this time change to define the

random choice process ρ by ρt(i, a) = ρ′f−1(t)(i, a) for each alternative i, menu a

and time t.

Step 2: Time change sets learning rate

For any pair of alternatives k, l ∈ A we have

ρt(k, l) = ρ′f−1(t)(k, l)

=

(ρ′f−1(t)(i

∗, j∗)

ρ′t∗(i∗, j∗)

)× ρt∗(k, l)

=√t× ρ′f−1(1)(k, l)

=√t× ρ1(k, l)

hence√s× ρt(k, l) =

√stρ1(k, l) =

√t× ρs(k, l)

34

Page 35: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

t*

1

s

f(t)

t

ρs(i

∗, j∗)ρt(i∗, j∗)

2

Figure 8: Time change.

Step 3: Gaussian triangle condition

Note ρ satisfies Gaussian triangle condition, just like ρ′. The Gaussian triangle

condition is a restriction on each random choice rule ρt, and therefore not affected

by the time change.

Step 4: construct utility signal structure

Take an alternative i∗ ∈ A and, for each j ∈ A define µ : A→ R by

µ(j) :=√

2 ρ1(j, i∗), for all j ∈ A

Since we use the convention ρt(i, i) = 1/2 for all i ∈ A, this implies µ(i∗) = 0.

Define γ : A × A → [−1, 1] so that (µ, γ) is simple: let γ(i, i) = 1 for all i ∈ Aand γ(i, j) = 0 for all i 6= j.

Step 5: (µ, γ) is a representation for ρ

By the Gaussian learning rate and the time change, for each t > 0 and j ∈ A we

have √t µ(j) =

√2√t ρ1(j, i

∗) =√

2 ρt(j, i∗)

35

Page 36: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

By Gaussian triangle condition, for every i, j ∈ A and t > 0

ρt(i, j) = Φ (ρt(i, j))

= Φ (ρt(i, i∗) + ρt(i

∗, j))

= Φ (ρt(i, i∗)− ρt(j, i∗))

= Φ

(√t(µ(i)− µ(j))√

2

)

which together with the necessity part shows that ρ is a Bayesian learning process

with a simple utility signal structure.

Proof of Proposition 2

Since i j we have µi > µj so E[mi(t)−mj(t)]→ µi−µj > 0. Since Var[m(t)]→0 when t goes to infinity, the probability that mi(t) > mj(t) goes to one.

Proof of Proposition 7

In general, for random utility models that specify Gaussian error terms, calculat-

ing choice probabilities from the parameters of the model involves the numerical

computation of multivariate integrals. Here we show that for the starting time

of a random choice process, we can obtain expressions for ternary choices in

terms of elementary (trigonometric) functions. We interpret choices at time zero

as the limit of the distribution of choices at time t when t is taken arbitrarily

small. We will obtain expressions that show that a Bayesian learning model and

a naıve learning probit model sharing the same utility signal structure (µ, γ) will

in general exhibit different choice behavior for times close to zero. The two will

coincide only in the knife-edge case in which the common utility signal structure

is simple. In this knife-edge case, both choice processes start with a uniform

distribution over the available alternatives.

Let (µ, γ) be a utility signal structure and let b = 1, 2, 3. First consider the

naıve learning probit model. The utility signal X corresponding to b has three

dimensions. For every time t > 0 we have X(t) ∼ N (tµ, tΛΛ′). Let L1 be the

2× 3 matrix given by

L1 =

[−1 1 0

−1 0 1

]

36

Page 37: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

so that L1X(t) = (X2(t)−X1(t), X3(t)−X1(t)). Then

ρt(1, b) = PX1(t) > X2(t) and X1(t) > X3(t) = PL1X(t) ≤ 0.

The vector L1X(t) is jointly Gaussian with mean tL1µ and covariance tL1ΛΛ′L′1.

So L1X(t) has the same distribution as tL1µ+√tMB where B is a bi-dimensional

standard Gaussian random vector and MM ′ = L1ΛΛ′L′1 has full rank. Take M

as the Cholesky factorization

M =

2(1− γ12) 01+γ23−γ12−γ13√

2(1−γ12)

√2(1− γ12)− (1+γ23−γ12−γ13)2

2(1−γ12)

then we can write for each t > 0,

ρt(1, b) = PL1X(t) ≤ 0

= PtL1µ+√tMB ≤ 0

= PMB ≤ −√tL1µ

and taking t→ 0 we obtain

ρ0(1, b) = PMB ≤ 0

Now MB ≤ 0 if and only if

0 ≥ B1

√2(1− γ12)

and

0 ≥ 1+γ23−γ12−γ13√2(1−γ12)

B1 +√

2(1− γ12)− (1+γ23−γ12−γ13)22(1−γ12) B2

if and only if

B1 ≤ 0

and

B2 ≤ −B1(1+γ23−γ12−γ13)√

2(1−γ12)2(1−γ13)−(1+γ23−γ12−γ13)2

which describes a cone in R2 and, due to the circular symmetry of the standard

Gaussian distribution, we have

ρ0(1, b) =1

4+

1

2πarctan

((1 + γ23 − γ12 − γ13)√

2(1− γ12)2(1− γ13)− (1 + γ23 − γ12 − γ13)2

).

37

Page 38: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

with analogous expressions for ρ0(2, b) and ρ0(3, b).

Now consider the Bayesian learning model with the same utility signal struc-

ture and prior parameters (m0, s20). Let L1 be defined as before so that L1m(t) =

(m2(t) − m1(t),m3(t) − m1(t)). Then the Bayesian learner chooses alternative

one at time t > 0 with probability

ρt(1, b) = Pm1(t) > m2(t) and m1(t) > m3(t) = PL1m(t) ≤ 0

The vector L1m(t) is jointly Gaussian with mean L1s(t) [s(0)−1m(0) + t(ΛΛ′)−1µ]

and covariance tL1s(t)ΛΛ′s(t)′L′1.

The two-dimensional random vector L1m(t) has the same distribution as

L1s(t)s(0)−1m(0) + tL1s(t)(ΛΛ′)−1µ+√tM(t)B

whereB is a bi-dimensional standard Gaussian andM(t)M(t)′ = L1s(t)ΛΛ′s(t)′L′1.

Notice the contrast with the naıve learner model, where M did not depend on t.

We can take M(t) to be the Cholesky factorization

M(t) =

[M11(t) 0

M21(t) M22(t)

]

given by

M11(t) =

√C1(t)(

−1− 3s4t2 − s6t3 + γ212 + γ213 − 2γ12γ13γ23 + γ223 + s2t(−3 + γ212 + γ213 + γ223

))2

where

C1(t) = s4[−2s8t4(−1 + γ12)− 2s6t3

(−4 + 2γ12(1 + γ12) + (γ13 − γ23)2

)

−4s2t(2 + γ12)(−1 + γ212 + γ213 − 2γ12γ13γ23 + γ223

)

−s4t2(−12 + 7γ213 + 2γ12

(γ12(5 + γ12) + γ213

)− 6(1 + 2γ12)γ13γ23 + (7 + 2γ12)γ

223

)

−(−1 + γ212 + γ213 − 2γ12γ13γ23 + γ223

) (2 + 2γ12 − (γ13 + γ23)

2)]

and the expressions for M21(t) and M22(t) are similarly cumbersome and omitted.

Now we can write, for each t > 0,

ρt(1, b) = PL1m(t) ≤ 0

= PL1s(t)s(0)−1m(0) + tL1s(t)(ΛΛ′)−1µ+√tM(t)B ≤ 0

= P√tM(t)B ≤ −L1s(t)s(0)−1m(0)− tL1s(t)(ΛΛ′)−1µ

= PM(t)B ≤ − 1√tL1s(t)s(0)−1m(0)−

√tL1s(t)(ΛΛ′)−1µ

38

Page 39: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

Lemma 10. In the limit as t goes to zero,

limt→0+

1√tL1s(t)s(0)−1m(0) = 0,

limt→0+

M11(t) = s2

√2 + 2γ12 − γ213 − 2γ13γ23 − γ223

1 + 2γ12γ13γ23 − γ212 − γ213 − γ223> 0,

and

limt→0+

M21(t)

M22(t)=

(1 + γ12)(1 + γ13)− γ23(1 + γ12 + γ13 + γ23)√(3 + 2γ12 + 2γ13 + 2γ23)(1 + 2γ12γ13γ23 − γ212 − γ213 − γ223)

.

Proof. Long and cumbersome, omitted.

Using Lemma 10 we have

ρ0(1, b) = limt→0+

ρt(1, b)

= limt→0+

PM(t)B ≤ − 1√

tL1s(t)s(0)−1m(0)−

√tL1s(t)(ΛΛ′)−1µ

=PB1 lim

t→0+M11(t) ≤ 0 and B2 ≤ −B1 lim

t→0+

M21(t)

M22(t)

=PB1 ≤ 0 and B2 ≤ −B1

(1+γ12)(1+γ13)−γ23(1+γ12+γ13+γ23)√(3+2γ12+2γ13+2γ23)(1+2γ12γ13γ23−γ212−γ213−γ223)

and by the circular symmetry of the standard Gaussian distribution we obtain

ρ0(1, b) =1

4+

1

2πarctan

((1+γ12)(1+γ13)−γ23(1+γ12+γ13+γ23)√

(3+2γ12+2γ13+2γ23)(1+2γ12γ13γ23−γ212−γ213−γ223)

)

with analogous expressions for ρ0(2, b) and ρ0(3, b).

Proof of Corollary 8

Immediate from Proposition 7.

Proof of Theorem 3

When the menu of available alternatives is 1, 2, both 1 and 2 are equally likely

to be chosen at the start of the random choice process, i.e., in the limit as t→ 0+.

By Propostion 7, when the menu includes alternative 3, the probability that

alternative 1 is chosen at the start of the random choice process is given by

1

4+

1

2πarctan

((1 + γ12)(1 + γ13)− γ23(1 + γ12 + γ13 + γ23)√

(3 + 2γ12 + 2γ13 + 2γ23)(1 + 2γ12γ13γ23 − γ212 − γ213 − γ223)

)

39

Page 40: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

and the probability for alternative 2 is given by

1

4+

1

2πarctan

((1 + γ12)(1 + γ23)− γ13(1 + γ12 + γ13 + γ23)√

(3 + 2γ12 + 2γ13 + 2γ23)(1 + 2γ12γ13γ23 − γ212 − γ213 − γ223)

)

since the function arctan is strictly increasing, the probability of alternative 1 is

larger than the probability of alternative 2 if and only if

(1+γ12)(1+γ13)−γ23(1+γ12+γ13+γ23) > (1+γ12)(1+γ23)−γ13(1+γ12+γ13+γ23)

which holds if and only if

(γ13 − γ23)(2 + 2γ12 + γ13 + γ23) > 0

which holds since all γij are positive and since the pair (1, 3) is more similar than

the pair (2, 3) we have γ13 > γ23. This shows that for t close to zero, by continuity,

introducing alternative 3 hurts alternative 2 more than it hurts alternative 1.

The second part of the Theorem follows from the fact that, as t goes to

infinity, ρt converges to a standard probit random choice rule with E[m(t)]→ µ

and Var[m(t)]→ ΛΛ′.

Proof of Corollary 4

Fix utility signal parameters γ12, γ13, γ23 ∈ [−1, 1] such that γ13 > γ12 and γ13 >

γ23. Since γij form a positive definite matrix for every finite menu, we have the

following positive determinant

1 + 2γ12γ13γ23 − γ212 − γ213 − γ223 > 0

therefore

γ12γ23 −√

(1− γ212)(1− γ223) < γ13 < γ12γ23 +√

(1− γ212)(1− γ223)

which provides bounds for γ13 for fixed values of γ12 and γ23. The next Lemma

shows that, when γ13 is the largest of the three parameters, the upper bound is

strictly positive.

Lemma 11. Let γ be utility signal parameters with γ13 > γ12 and γ13 > γ23.

Then γ12γ23 +√

(1− γ212)(1− γ223) > 0.

40

Page 41: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

Proof. If γ12 and γ23 are negative then γ12γ23 is positive and we are done. Suppose

γ12 ≥ 0, then γ12γ23 +√

(1− γ212)(1− γ223) > γ13 > γ12 ≥ 0 and we are done. The

same applies if γ23 ≥ 0.

By Propostion 7, to prove (i) it suffices to show that, as γ13 approaches the

limit γ12γ23 +√

(1− γ212)(1− γ223) from the left, we have

(1 + γ12)(1 + γ13)− γ23(1 + γ12 + γ13 + γ23)√(3 + 2γ12 + 2γ13 + 2γ23)(1 + 2γ12γ13γ23 − γ212 − γ213 − γ223)

→ +∞ (6)

so it is sufficient to show that the numerator above converges to a strictly positive

number, while the denominator is positive and converges to zero.

To verify that the numerator in (6) converges to a strictly positive number,

first suppose γ23 ≤ 0. Then as γ13 → γ12γ23+√

(1− γ212)(1− γ223), the numerator

tends to

(1 + γ12)

(1 + γ12γ23 +

√(1− γ212)(1− γ223)

)

− γ23(

1 + γ12 + γ12γ23 +√

(1− γ212)(1− γ223) + γ23

)

= (1 + γ12)

(1− γ223 + γ12γ23 +

√(1− γ212)(1− γ223)

)

− γ23 (1 + γ12)− γ23√

(1− γ212)(1− γ223)

which is strictly positive since 1+γ12 > 0, 1−γ223 > 0, γ12γ23+√

(1− γ212)(1− γ223) >0 by the previous Lemma, and since we are assuming γ23 ≤ 0, the terms−γ23 (1 + γ12)

and −γ23√

(1− γ212)(1− γ223) are also positive.

Now suppose instead that γ23 > 0. Since the pair (1, 3) is more similar than

(2, 3) we have γ12γ23 +√

(1− γ212)(1− γ223) > γ23 which holds if and only if

41

Page 42: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

γ23 <√

1 + γ12/√

2. The numerator in (6) converges to

(1 + γ12)

(1− γ223 + γ12γ23 +

√(1− γ212)(1− γ223)

)

− γ23 (1 + γ12)− γ23√

(1− γ212)(1− γ223)

> (1 + γ12)(1− γ223 + γ23 − γ23

)− γ23

√(1− γ212)(1− γ223)

> (1 + γ12)(1− γ223

)−√

1 + γ12√2

√(1− γ212)(1− γ223)

= (1 + γ12)(1− γ223

)− 1√

2(1 + γ12)

√1− γ12

√1− γ223

= (1 + γ12)√

1− γ223(√

1− γ223 −√

1− γ12√2

)

where (1 + γ12) > 0,√

1− γ223 > 0 and the expression in parenthesis is strictly

positive since 0 < γ23 <√

1 + γ12/√

2 implies γ223 < (1 + γ12)/2 which implies

√1− γ223 >

√1− 1 + γ12

2=

√1− γ12√

2.

Hence the numerator in (6) converges to a strictly positive number.

Now consider the denominator in (6). The determinant (1+2γ12γ13γ23−γ212−γ213 − γ223) is positive and tends to zero as γ13 goes to the upper bound γ12γ23 +√

(1− γ212)(1− γ223) from below. Moreover, it is easy to check that the bounded

set of utility signal parameters (γ12, γ13, γ23) ∈ [−1, 1]3 : (1 + 2γ12γ13γ23 − γ212 −γ213 − γ223) > 0 is entirely contained in the half space (γ12, γ13, γ23) ∈ [−1, 1]3 :

(3 + 2γ12 + 2γ13 + 2γ23) ≥ 0. Therefore the expression (3 + 2γ12 + 2γ13 + 2γ23) in

the denominator is positive and bounded, so the entire denominator is positive

and goes to zero as desired.

To prove (ii), note that as t → ∞ we have tVar[m(t)] → ΛΛ′ = (γij) and

E[m(t)]→ µ. Since 1 ∼ 2 ∼ 3 we have µ1 = µ2 = µ3 and therefore ρt(i, 1, 2, 3)tends to ρ0(i, 1, 2, 3) given in Proposition 7 as t→∞. In other words, in this

knife-edge case in which the µ parameters of all three alternatives are identical,

the Bayesian Probit Process behaves, in the limit as t goes to infinity, exactly as

the naıve probit behaves at time zero. The result then follows from Propostion 7.

42

Page 43: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

Proof of Theorem 5

Let ε > 0 and recall m(t) is the (random) vector of posterior mean beliefs at time

t, which has a joint normal distribution. The probability that alternative 1 is

chosen at time t is equal to the probability that m1(t) > m2(t) and m1(t) > m3(t).

This happens if and only if the bi-dimensional vector (m2(t) − m1(t),m3(t) −m1(t)) has negative coordinates. This vector has a joint normal distribution

with mean given by

E[m2(t)−m1(t)] =s2t[µ3(1 + s2t+ γ12)(γ13 − γ23)

+ µ1

((−1− s2t)(1 + s2t+ γ12) + γ13γ23 + γ223

)

+µ2

(1 + γ12 + s2t(2 + s2t+ γ12)− γ13(γ13 + γ23)

)]/

[(1 + s2t

) (1 + s2t

(2 + s2t

)− γ212 − γ213 − γ223

)+ 2γ12γ13γ23

]

and

E[m3(t)−m1(t)] =s2t[µ2(1 + s2t+ γ13)(γ12 − γ23)

+ µ1

((−1− s2t)(1 + s2t+ γ13) + γ12γ23 + γ223

)

+µ3

(1 + γ13 + s2t(2 + s2t+ γ13)− γ12(γ12 + γ23)

)]/

[(1 + s2t

) (1 + s2t

(2 + s2t

)− γ212 − γ213 − γ223

)+ 2γ12γ13γ23

]

The denominator in both expressions can be written as(1 + s2t

) (1 + s2t

(2 + s2t

)− γ212 − γ213 − γ223

)+ 2γ12γ13γ23 =

s2t(s2t(3 + s2t

)+ 3− γ212 − γ213 − γ223

)+

1 + 2γ12γ13γ23 − γ212 − γ213 − γ223

which is clearly positive since s, t > 0, γ2ij < 1 and 1 + 2γ12γ13γ23−γ212−γ213−γ223is the positive determinant of the covariance matrix of X(t).

In both numerators, the expression multiplying the coefficient µ3 is positive.

In the first case, note that 1 + γ12 + s2 > 0 and that γ13 > γ23 since the pair

(1, 3) is more similar than the pair (2, 3). In the second case, the expression

multiplying µ3 can be written as

[1− γ12γ23] +[2s2t+ s4t2

]+[γ13(1 + s2t)− γ212

]

where each expression in brackets is positive. Therefore for fixed t both co-

ordinates of the mean vector (E[m2(t)−m1(t)],E[m3(t)−m1(t)]) can be made

arbitrarily negative by taking µ3 negative and sufficiently large in absolute value.

43

Page 44: Random Choice and Learning - FGV-EESPeesp.fgv.br/sites/eesp.fgv.br/files/file/Paulo_Natenzon.pdf · Random Choice and Learning Paulo Natenzon Washington University in Saint Louis

The covariance matrix Var[m(t)] = ts(t)(ΛΛ′)−1s(t)′ does not depend on µ.

Since µ1 = µ2 we have both ρt(1, 1, 2, 3) and ρt(2, 1, 2, 3) converging to 1/2

when t goes to infinity. Note that, while increasing the absolute value of the

negative parameter µ3 does not change Var[m(t)] for any t, it decreases both

E[m2(t) − m1(t)] and E[m3(t) − m1(t)] for every t > 0 and therefore increases

ρt(1, 1, 2, 3) for every t > 0. Moreover, for fixed t > 0, ρt(1, 1, 2, 3) can be

made arbitrarily close to 1 by taking µ3 sufficiently negative. This guarantees

that we can have the attraction effect starting arbitrarily early in the random

choice process. Moreover, since E[m2(t) − m1(t)] above converges to zero from

below as t goes to infinity, ρt(1, 1, 2, 3) will converge to 1/2 from above, while

ρt(2, 1, 2, 3) will converge to 1/2 from below.

44