23
Psychological Review 1991. Vol. 98, No. 4, 506-528 Copyright 1991 by the American Psychological Association. Inc. 0033-295X/91/J3.00 Probabilistic Mental Models: A Brunswikian Theory of Confidence Gerd Gigerenzer Center for Advanced Study in the Behavioral Sciences Stanford, California Ulrich Hoffrage and Heinz Kleinbfilting University of Salzburg, Salzburg, Austria Research on people's confidence in their general knowledge has to date produced two fairly stable effects, many inconsistent results, and no comprehensive theory. We propose such a comprehensive framework, the theory of probabilistic mental models (PMM theory). The theory (a) explains both the overconfidence effect (mean confidence is higher than percentage of answers correct) and the hard-easy effect (overconfidence increases with item difficulty) reported in the literature and (b) predicts conditions under which both effects appear, disappear, or invert. In addition, (c) it predicts a new phenomenon, the confidence-frequency effect, a systematic difference between a judgment of confidence in a single event (i.e., that any given answer is correct) and a judgment of the frequency of correct answers in the long run. Two experiments are reported that support PMM theory by confirming these predictions, and several apparent anomalies reported in the literature are ex- plained and integrated into the present framework. Do people think they know more than they really do? In the last 15 years, cognitive psychologists have amassed a large and apparently damning body of experimental evidence on over- confidence in knowledge, evidence that is in turn part of an even larger and more damning literature on so-called cognitive biases. The cognitive bias research claims that people are natu- rally prone to making mistakes in reasoning and memory, in- cluding the mistake of overestimating their knowledge. In this article, we propose a new theoretical model for confidence in knowledge based on the more charitable assumption that peo- ple are good judges of the reliability of their knowledge, pro- vided that the knowledge is representatively sampled from a specified reference class. We claim that this model both pre- dicts new experimental results (that we have tested) and ex- plains a wide range of extant experimental findings on confi- dence, including some perplexing inconsistencies. Moreover, it is the first theoretical framework to integrate the two most striking and stable effects that have emerged from confidence studies—the overconfidence effect and the hard- easy effect—and to specify the conditions under which these effects can be made to appear, disappear, and even invert. In most recent studies (including our own, reported herein), sub- This article was written while Gerd Gigerenzer was a fellow at the Center for Advanced Study in the Behavioral Sciences, Stanford, California. We are grateful for financial support provided by the Spencer Foundation and the Deutsche Foischungsgemeinschaft (DFG 170/2-1). We thank Leda Cosmides, Lorraine Daston, Baruch FischhofF, Jen- nifer Freyd, Kenneth Hammond, Wolfgang Hell, Sarah Licmenstein, Kathleen Much, John Tooby, Amos Tversky, and an anonymous re- viewer for helpful comments on earlier versions of this article. Correspondence concerning this article should be addressed to Gerd Gigerenzer, who is now at the Institut fur Psychologic, Hellbrun- nerstrasse 34, Universitat Salzburg, A-5020 Salzburg, Austria. jects are asked to choose between two alternatives for each of a series of general-knowledge questions. Here is a typical exam- ple: "Which city has more inhabitants? (a) Hyderabad or (b) Islamabad." Subjects choose what they believe to be the correct answer and then are directed to specify their degree of confi- dence (usually on a 50%-100% scale) that their answer is indeed correct. After the subjects answer many questions of this sort, the responses are sorted by confidence level, and the relative frequencies of correct answers in each confidence category are calculated. The overconfidence effect occurs when the confi- dence judgments are larger than the relative frequencies of the correct answers; the hard-easy effect occurs when the degree of overconfidence increases with the difficulty of the questions, where the difficulty is measured by the percentage of correct answers. Both effects seem to be stable. Fischhoff (1982) reviewed the attempts to eliminate overconfidence by numerous "debiasing methods," such as giving rewards, clarifying instructions, warn- ing subjects in advance about the problem, and using better response modes—all to no avail. He concluded that these ma- nipulations "have so far proven relatively ineffective," and that overconfidence was "moderately robust" (p. 440). von Winter- feldt and Edwards (1986, p. 539) agreed that "overconfidence is a reliable, reproducible finding." Yet these robust phenomena still await a theory. In particular, we lack a comprehensive theo- retical framework that explains both phenomena, as well as the various exceptions reported in the literature, and integrates the several local explanatory attempts already advanced. That is the aim of this article. It consists of four parts: (a) an exposition of the proposed theory of probabilistic mental models (PMM theory), including predictions of new experimental findings based on the theory; (b) a report of our experimental tests con- firming these predictions; (c) an explanation of apparent anoma- lies in previous experimental results, by means of PMMs; and (d) a concluding discussion. 506

Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

Embed Size (px)

Citation preview

Page 1: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

Psychological Review1991. Vol. 98, No. 4, 506-528

Copyright 1991 by the American Psychological Association. Inc.0033-295X/91/J3.00

Probabilistic Mental Models: A Brunswikian Theory of Confidence

Gerd GigerenzerCenter for Advanced Study in the Behavioral Sciences

Stanford, California

Ulrich Hoffrage and Heinz KleinbfiltingUniversity of Salzburg, Salzburg, Austria

Research on people's confidence in their general knowledge has to date produced two fairly stableeffects, many inconsistent results, and no comprehensive theory. We propose such a comprehensiveframework, the theory of probabilistic mental models (PMM theory). The theory (a) explains both

the overconfidence effect (mean confidence is higher than percentage of answers correct) and thehard-easy effect (overconfidence increases with item difficulty) reported in the literature and (b)predicts conditions under which both effects appear, disappear, or invert. In addition, (c) it predicts

a new phenomenon, the confidence-frequency effect, a systematic difference between a judgmentof confidence in a single event (i.e., that any given answer is correct) and a judgment of the frequencyof correct answers in the long run. Two experiments are reported that support PMM theory byconfirming these predictions, and several apparent anomalies reported in the literature are ex-plained and integrated into the present framework.

Do people think they know more than they really do? In the

last 15 years, cognitive psychologists have amassed a large and

apparently damning body of experimental evidence on over-

confidence in knowledge, evidence that is in turn part of an

even larger and more damning literature on so-called cognitive

biases. The cognitive bias research claims that people are natu-

rally prone to making mistakes in reasoning and memory, in-

cluding the mistake of overestimating their knowledge. In this

article, we propose a new theoretical model for confidence in

knowledge based on the more charitable assumption that peo-

ple are good judges of the reliability of their knowledge, pro-

vided that the knowledge is representatively sampled from a

specified reference class. We claim that this model both pre-

dicts new experimental results (that we have tested) and ex-

plains a wide range of extant experimental findings on confi-

dence, including some perplexing inconsistencies.

Moreover, it is the first theoretical framework to integrate the

two most striking and stable effects that have emerged from

confidence studies—the overconfidence effect and the hard-

easy effect—and to specify the conditions under which these

effects can be made to appear, disappear, and even invert. In

most recent studies (including our own, reported herein), sub-

This article was written while Gerd Gigerenzer was a fellow at theCenter for Advanced Study in the Behavioral Sciences, Stanford,California. We are grateful for financial support provided by theSpencer Foundation and the Deutsche Foischungsgemeinschaft(DFG 170/2-1).

We thank Leda Cosmides, Lorraine Daston, Baruch FischhofF, Jen-nifer Freyd, Kenneth Hammond, Wolfgang Hell, Sarah Licmenstein,Kathleen Much, John Tooby, Amos Tversky, and an anonymous re-viewer for helpful comments on earlier versions of this article.

Correspondence concerning this article should be addressed toGerd Gigerenzer, who is now at the Institut fur Psychologic, Hellbrun-nerstrasse 34, Universitat Salzburg, A-5020 Salzburg, Austria.

jects are asked to choose between two alternatives for each of a

series of general-knowledge questions. Here is a typical exam-

ple: "Which city has more inhabitants? (a) Hyderabad or (b)

Islamabad." Subjects choose what they believe to be the correct

answer and then are directed to specify their degree of confi-

dence (usually on a 50%-100% scale) that their answer is indeed

correct. After the subjects answer many questions of this sort,

the responses are sorted by confidence level, and the relative

frequencies of correct answers in each confidence category are

calculated. The overconfidence effect occurs when the confi-

dence judgments are larger than the relative frequencies of the

correct answers; the hard-easy effect occurs when the degree of

overconfidence increases with the difficulty of the questions,

where the difficulty is measured by the percentage of correct

answers.

Both effects seem to be stable. Fischhoff (1982) reviewed the

attempts to eliminate overconfidence by numerous "debiasing

methods," such as giving rewards, clarifying instructions, warn-

ing subjects in advance about the problem, and using better

response modes—all to no avail. He concluded that these ma-

nipulations "have so far proven relatively ineffective," and that

overconfidence was "moderately robust" (p. 440). von Winter-

feldt and Edwards (1986, p. 539) agreed that "overconfidence is

a reliable, reproducible finding." Yet these robust phenomena

still await a theory. In particular, we lack a comprehensive theo-

retical framework that explains both phenomena, as well as the

various exceptions reported in the literature, and integrates the

several local explanatory attempts already advanced. That is the

aim of this article. It consists of four parts: (a) an exposition of

the proposed theory of probabilistic mental models (PMM

theory), including predictions of new experimental findings

based on the theory; (b) a report of our experimental tests con-

firming these predictions; (c) an explanation of apparent anoma-

lies in previous experimental results, by means of PMMs; and

(d) a concluding discussion.

506

Page 2: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

PROBABILISTIC MENTAL MODELS 507

PMM Theory

This theory deals with spontaneous confidence—that is,with an immediate reaction, not the product of long reflection.Figure 1 shows a flow chart of the processes that generate confi-dence judgments in two-alternative general-knowledge tasks.'There are two strategies. When presented with a two-alterna-tive confidence task, the subject first attempts to constructwhat we call a local mental model (local MM) of the task. This isa solution by memory and elementary logical operations. If thisfails, a PMM is constructed that goes beyond the structure ofthe task in using probabilistic information from a natural envi-ronment.

For convenience, we illustrate the theory using a problemfrom the following experiments: "Which city has more inhabit-ants? (a) Heidelberg or (b) Bonn." As explained earlier, the sub-jects' task is to choose a or b and to give a numerical judgmentof their confidence (that the answer chosen is correct).

Local MM

We assume that the mind first attempts a direct solution thatcould generate certain knowledge by constructing a local MM.For instance, a subject may recall from memory that Heidel-berg has a population between 100,000 and 200,000, whereasBonn has more than 290,000 inhabitants. This is already suffi-cient for the answer "Bonn" and a confidence judgment of100%. In general, a local MM can be successfully constructed if(a) precise figures can be retrieved from memory for both alter-natives, (b) intervals that do not overlap can be retrieved, or (c)elementary logical operations, such as the method of exclusion,can compensate for missing knowledge. Figure 2 illustrates asuccessful local MM for the previous example. Now consider atask where the target variable is not quantitative (such as thenumber of inhabitants) but is qualitative: "If you see the nation-ality letter P on a car, is it from Poland or Portugal?" Here,either direct memory about the correct answer or the method ofexclusion is sufficient to construct a local MM. The latter isillustrated by a subject reasoning "Since I know that Poland hasPL it must be Portugal" (Allwood & Montgomery, 1987, p. 370).

The structure of the task must be examined to define moregenerally what is referred to as a local MM. The task consists oftwo objects, a and b (alternatives), and a target variable t. First, alocal MM of this task is local; that is, only the two alternativesare taken into account, and no reference class of objects is con-structed (see the following discussion). Second, it is direct; thatis, it contains only the target variable (e.g, number of inhabit-ants), and no probability cues are used. Third, no inferencesbesides elementary operations of deductive logic (such as exclu-sion) occur. Finally, if the search is successful, the confidence inthe knowledge produced is evaluated as certain. In these re-spects, our concept of a local MM is similar to what Johnson-Laird (1983, pp. 134-142) called a "mental model" in syllogisticinference.

A local MM simply matches the structure of the task; there isno use of the probability structure of an environment and, con-sequently, no frame for inductive inference as in a PMM. Be-cause memory can fail, the "certain" knowledge produced cansometimes be incorrect. These failures contribute to the

amount of overconfidence to be found in 100%-confident judg-ments.

PMM

Local MMs are of limited success in general-knowledgetasks2 and in most natural environments, although they seem tobe sufficient for solving some syllogisms and other problems ofdeductive logic (see Johnson-Laird, 1983). If no local MM canbe activated, it is assumed that a PMM is constructed next. APMM solves the task by inductive inference, and it does so byputting the specific task into a larger context. A PMM connectsthe specific structure of the task with a probability structure ofa corresponding natural environment (stored in long-termmemory). In our example, a natural environment could be theclass of all cities in Germany with a set of variables defined onthis class, such as the number of inhabitants. This task selectsthe number of inhabitants as the target and the variables thatcovary with this target as the cues.

A PMM is different from a local MM in several respects.First, it contains a reference class of objects that includes theobjects a and b. Second, it uses a network of variables in addi-tion to the target variable for indirect inference. Thus, it isneither local nor direct. These two features also change thethird and fourth aspects of a local MM. Probabilistic inferenceis part of the cognitive process, and uncertainty is part of theoutcome.

Reference Class

We use Brunswik's (1943, p. 257) term reference class to de-fine the class of objects or events that a PMM contains. In ourexample, the reference class "all cities in Germany" may begenerated. To generate a reference class means to generate a setof objects known from a person's natural environment that con-tains objects a and b.

The reference class determines which cues can function asprobability cues for the target variable and what their cue vali-dities are. For instance, a valid cue in the reference class "allcities in Germany" would be the soccer-team cue; that is,whether a city's soccer team plays in the German soccer Bun-desliga, in which the 18 best teams compete. Cities with moreinhabitants are more likely to have a team in the Bundesliga.The soccer-team cue would not help in the Hyderabad-Islama-bad task, which must be solved by a PMM containing a differ-ent reference class with different cues and cue validities.

Probability Cues

A PMM for a given task contains a reference class, a targetvariable, probability cues, and cue validities. A variable is a

1 For convenience, the theory is presented here in its complete form,

although parts of it were developed after Experiment 1 was performed.

All those parts were subjected to an independent test in Experiment 2.2 Allwood and Montgomery (1987, pp. 369-370) estimated from ver-

bal protocols that about 19% of their general-knowledge questions

were solved by "full recognition," which seems to be equivalent to

memory and elementary logical operations only.

Page 3: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

508 G. GIGERENZER, U. HOFFRAGE, AND H. KLEINBOLTING

STRUCTURE OF TASK

objects: a and b

variable: t (target)

problem: a > 6 on r?confidence?

environment;generate cues and

ue validities

MENTAL MODELS OF TASK

local MM

objects: a and 6

variable: t (target)

solutionstrategy: retrieval from memory

and logical operations

STRUCTURE OF PERCEIVEDENVIRONMENT

PMM

objects: reference class R(a,6,£ R)

variables: t (target), cues,cue validities

solutionstrategy: inductive inference

known environment

objects: reference class R

variables: t (target), cues,ecological validities

generate cuehighest in cue

validity

cue generation and testing cycle

Figure 1. Cognitive processes in solving a two-alternative general-knowledge task.(MM = mental model; PMM = probabilistic mental model.)

probability cue C, (for a target variable in a reference class R) if

the probability p(a) of a being correct is different from the con-

ditional probability of a being correct, given that the values of a

and b differ on C,. If the cue is a binary variable such as the

soccer-team cue, this condition can be stated as follows:

p(a) + p(a\aCtb;R),

where aC,b signifies the relation of a and b on the cue C, (e.g., a

has a soccer team in the Bundesliga, but b does not) and

p(a\aCib; R) is the cue validity of Cf in R.

Thus, cue validities are thought of as conditional probabili-

ties, following Rosch (1978) rather than Brunswik (1955), who

defined his "cue utilizations" as Pearson correlations. Condi-

tional probabilities need not be symmetric as correlations are.

This allows the cue to be a better predictor for the target than

the target is for the cue, or vice versa. Cue validity is a concept

in the PMM, whereas the corresponding concept in the environ-

ment is ecological validity (Brunswik, 1955), which is the true

relative frequency of any city having more inhabitants than any

other one in R \faCfi. For example, consider the reference class

all cities in Germany with more than 100,000 inhabitants. The

ecological validity of the soccer-team cue here is .91 (calculated

STRUCTUREOP TASK

objects:

variable:

problem:

a and b

t (target)

a>bon r?

confidence?

LOCALMENTAL MODEL

| — a — | | — b -^

100,000 200,000 300,000 target

(number of inhabitants)

choose "6";

say "100%

confident"

Figure 2. Local mental model of a two-alternative general-knowledge task.

Page 4: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

PROBABILISTIC MENTAL MODELS 509

for 1988/1989 for what then was West Germany). That is, if onechecked all pairs in which one city a has a team in the Bundes-liga but the other city b does not, one would find that in 91 % ofthese cases city a has more inhabitants.

Vicarious Functioning

Probability cues are generated, tested, and if possible, acti-vated. We assume that the order in which cues are generated isnot random; in particular, we assume that the order reflects thehierarchy of cue validities. For the reference class all cities inGermany, the following cues are examples that can be gener-ated: (a) the soccer-team cue; (b) whether one city is a statecapital and the other is not (state capital cue); (c) whether onecity is located in the Ruhrgebiet, the industrial center of Ger-many, and the other in largely rural Bavaria (industrial cue); (d)whether the letter code that identifies a city on a license plate isshorter for one city than for the other (large cities are usuallyabbreviated by only one letter, smaller cities by two or three;license plate cue); and (e) whether one has heard of one city andnot of the other (familiarity cue). Consider now the Heidelberg-Bonn problem again. The first probability cue is generated andtested to see whether it can be activated for that problem. Be-cause neither of the two cities has a team in the Bundesliga, thefirst cue does not work.

In general, with a binary cue and the possibility that thesubject has no knowledge, there are nine possibilities (see Figure3). In only two of these can a cue be activated. In all other cases,the cue is useless (although one could further distinguish be-tween the four known-unknown cases and the three remainingcases). If a cue cannot be activated, then a further cue is gener-ated and tested. In the Heidelberg-Bonn task, none of the fivecues cited earlier can in fact be activated. Finally, one cue maybe generated that can be activated, such as whether one city isthe capital of the country and the other is not (capital cue). This

City a

Soccer team in Bundesliga?

yes no unknown

yes

no

unknown

Figure 3. Two conditions in which a cue can be activated.

cue has a small probability of being activated—a small activa-tion rate in R (because it applies only to pairs that includeBonn)—and it does not have a particularly high cue validity inR because it is well-known that Bonn is not exactly London orParis.

The Heidelberg-Bonn problem illustrates that probabilitycues may have small activation rates in R, and as a consequence,several cues may have to be generated and tested before one isfound that can be activated. The capital cue that can be acti-vated for the Heidelberg-Bonn comparison may fail for thenext problem, for instance a Heidelberg-Gottingen compari-son. Cues can substitute for one another from problem to prob-lem, a process that Brunswik (1955) called "vicarious func-tioning."

End of Cue Generation and Testing Cycle

If (a) the number of problems is large or other kinds of timepressure apply and (b) the activation rate of cues is rather small,then one can assume that the cue generation and testing cycleends after the first cue that can be activated has been found.Both conditions seem to be typical for general-knowledge ques-tions. For instance, even when subjects were explicitly in-structed to produce all possible reasons for and against eachalternative, they generated only about three on the average andfour at most (Koriat, Lichtenstein, &Rschhoff, 1980). If no cuecan be activated, we assume that choice is made randomly, and"confidence 50%" is chosen.

Choice of Answer and Confidence Judgment

Choice of answer and confidence judgment are determinedby the cue validity Choice follows the rule:

choose a ifp(a\aCib; R) > p(b\aC,b; R).

If a is chosen, the confidence that a is correct is given by thecue validity:

p(a\aCib; R).

Note that the assumption that confidence equals cue validityis not arbitrary; it is both rational and simple in the sense thatgood calibration is to be expected if cue validities correspond toecological validities. This holds true even if only one cue isactivated.

Thus, choice and confidence are inferred from the same acti-vated cue. Both are expressions of the same conditional proba-bility. Therefore, they need not be generated in the temporalsequence choice followed by confidence. The latter is, of course,typical for actual judgments and often enforced by the instruc-tions in confidence studies.

Confidence in the Long Run and Confidence in Single

Events

Until now, only confidence in single events—such as the an-swer "Bonn" is correct—has been discussed. Confidence inone's knowledge can also be expressed with respect to se-quences of answers or events, such as "How many of the last 50questions do you think you answered correctly?" This distinc-

Page 5: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

510 G. GIGERENZER, U. HOFFRAGE, AND H. KLEINBOLTING

tion is parallel to that between probabilities of single events and

relative frequencies in the long run—a distinction that is funda-

mental to all discussions on the meaning of probability (see

Gigerenzer et al., 1989). Probabilities of single events (confi-

dences) and relative frequencies are not the same for many

schools of probability, and we argue that they are not evaluated

by the same cognitive processes either.

Consider judgments of frequency. General-knowledge tasks

that involve a judgment of the frequency of correct answers

(frequency tasks) can rarely be answered by constructing a local

MM. The structure of the task contains one sequence of N

questions and answers, and the number of correct answers is the

target variable. Only limiting cases, such as small N (i.e, if only a

few questions are asked) combined with the belief that all an-

swers were correct, may allow one to solve this task by a local

MM. Again, to construct a local MM of the task means that the

mental model consists of only the local sequence of total N

answers (no reference class), and because one attempts to solve

the task by direct access to memory about the target variable,

no network of probability cues is constructed.

Similarly, a PMM of a frequency task is different from a

PMM of a confidence task. A confidence task about city size in

Germany has "cities in Germany" as a reference class; however,

a task that involves judgments of frequencies of correct answers

in a series of TV questions about city size has a different reference

class: Its reference class will contain series of similar questions

in similar testing situations. Because the target variable also

differs (number of correct answers instead of numberof inhabit-

ants), the PMM of a frequency task will also contain different

cues and cue validities. For instance, base rates of performance

in earlier genera] knowledge or similar testing situations could

serve as a probability cue for the target variable. Again, our

basic assumption is that a PMM connects the structure of the

task with a known structure of the subject's environment.

Table 1 summarizes the differences between PMMs that are

implied by the two different tasks. Note that in our account,

both confidences in a single event and judgments of frequency

are explained by reference to experienced frequencies. How-

ever, these frequencies relate to different target variables and

reference classes. We use this assumption to predict systematic

differences between these kinds of judgments.

Adaptive PMMs and Representative Sampling

A PMM is an inductive device that uses the "normal" lifeconditions in known environments as the basis for induction.

How well does the structure of probability cues defined on R in

a PMM represent the actual structure of probability cues in the

environment? This question is also known as that of "proper

cognitive adjustment" (Brunswik, 1964, p. 22). If the hierarchy

of cues and their validities corresponds to that of the ecological

validities, then the PMM is well adapted to a known environ-

ment. In Brunswik's view, cue validities are learned by observ-

ing the frequencies of co-occurrences in an environment.

A large literature exists that suggests that (a) memory is often

(but not always) excellent in storing frequency information

from various environments and (b) the registering of event oc-

currences for frequency judgments is a fairly automatic cogni-

tive process requiring very little attention or conscious effort

(e.g, Gigerenzer, 1984; Hasher, Goldstein, & Toppino, 1977;

Howell & Burnett, 1978; Zacks, Hasher, &Sanft, 1982). Hasher

and Zacks (1979) concluded that frequency of occurrence, spa-tial location, time, and word meaning are among the few

aspects of the environment that are encoded automatically and

that encoding of frequency information is "automatic at least in

part because of innate factors" (p. 360). In addition, Hintzman,

Nozawa, and Irmscher (1982) proposed that frequencies arestored in memory in a nonnumerical analog mode.

Whatever the mechanism of frequency encoding, we use the

following assumption for deriving our predictions: If subjects

had repeated experience with a reference class, a target variable,

and cues in their environment, we assume that cue validities

correspond well to ecological validities. (This holds true for the

average in a group of subjects, but individual idiosyncrasies in

learning the frequency structure of the environment may oc-

cur.) This is a bold assumption made in ignorance of potential

deviations between specific cue validities and ecological validi-ties. If such deviations existed and were known, predictions by

PMM theory could be improved. The assumption, however,

derives support from both the literature on automatic fre-quency processing and a large body of neo-Brunswikian re-

search on the correspondence between ecological validities andcue utilization (the latter of which corresponds to our cue vali-

dities; e.g., Arkes & Hammond, 1986; K. Armelius, 1979;

BrehmerA Joyce, 1988; MacGregor & Slovic, 1986).Note that this adaptiveness assumption does not preclude

that individuals (as well as the average subject) err. Errors canoccur even if a PMM is highly adapted to a given environment.

For instance, if an environment is changing or is changed in the

laboratory by an experimenter, an otherwise well-adaptedPMM may be suboptimal in a predictable way

Tablel

Probabilistic Mental Models for Confidence Task Versus Frequency Task: Differences Between

Target Variables, Reference Classes, and Probability Cues

PMM Confidence task Frequency task

Target variable

Reference class

Probability cues

Number of inhabitantsCities in Germany

For example, soccer-team cue

or state capital cue

Number of correct answers

Sets of general-knowledge questionsin similar testing situations

For example, base rates of previousperformance or averageconfidence in N answers

Note. For illustration, questions of the Heidelberg-Bonn type are used. PMM = probabilistic mental

model.

Page 6: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

PROBABILISTIC MENTAL MODELS 511

Brunswik's notion of "representative sampling" is important

here. If a person experienced a representative sample of objects

from a reference class, one can expect his or her PMM to be

better adapted to an environment than if he or she happened to

experience a skewed, unrepresentative sample.

Representative sampling is also important in understanding

the relation between a PMM and the task. If a PMM is well

adapted, but the set of objects used in the task (questions) is not

representative of the reference class in the environment, perfor-

mance in tasks will be systematically suboptimal.

To avoid confusion with terms such as calibration, we will use

the term adaptation only when we are referring to the relation

between a PMM and a corresponding environment—not, how-

ever, for the relation between a PMM and a task.

Predictions

A concrete example can help motivate our first prediction.

Two of our colleagues, K and O, are eminent wine tasters. K

likes to make a gift of a bottle of wine from his cellar to friend

O, on the condition that O guesses what country or region the

grapes were grown in. Because O knows the relevant cues, O

can usually pick a region with some confidence. O also knows

that K. sometimes selects a quite untypical exemplar from his

ample wine cellar to test Friend O's limits. Thus, for each indi-

vidual wine, O can infer the probability that the grapes ripened

in, say, Portugal as opposed to South Africa, with considerable

confidence from his knowledge about cues. In the long run,

however, O nevertheless expects the relative frequency of

correct answers to be lower because K occasionally selects un-

usual items.Consider tests of general knowledge, which share an impor-

tant feature with the wine-tasting situation: Questions are se-

lected to be somewhat difficult and sometimes misleading. This

practice is common and quite reasonable for testing people's

limits, as in the wine-tasting situation. Indeed, there is appar-

ently not a single study on confidence in knowledge where a

reference class has been defined and a representative (or ran-

dom) sample of general-knowledge questions has been drawn

from this population. For instance, consider the reference class

"metropolis" and the geographical north-south location as the

target variable. A question like "Which city is farther north? (a)

New York or (b) Rome" is likely to appear in a general-knowl-

edge test (almost everyone gets it wrong), whereas a compari-

son between Berlin and Rome is not.

The crucial point is that confidence and frequency judg-

ments refer to different kinds of reference classes. A set of ques-

tions can be representative with respect to one reference class

and, at the same time, selected with respect to the other class.

Thus, a set of 50 general-knowledge questions of the city type

may be representative for the reference class "sets of general-

knowledge questions" but not for the reference class "cities in

Germany" (because city pairs have been selected for being dif-ficult or misleading). Asking for a confidence judgment sum-

mons up a PMM on the basis of the reference class "cities in

Germany;" asking for a frequency judgment summons up a

PMM on the basis of the reference class "sets of general-knowl-

edge questions." The first prediction can now be stated.

1. Typical general-knowledge tasks elicit both overconfidence

and accurate frequency judgments. By "typical" general-knowl-

edge tasks we refer to a set of questions that is representative for

the reference class "sets of general-knowledge questions."

This prediction is derived in the following way: If (a) PMMs

for confidence tasks are well adapted to an environment con-

taining a reference class R (e.g., all cities in Germany) and (b) the

actual set of questions is not representative for R, but selected

for difficult pairs of cities, then confidence judgments exhibit

overconfidence. Condition A is part of our theory (the simplify-

ing assumption we just made), and Condition B is typical for

the general-knowledge questions used in studies on confidence

as well as in other testing situations.

If (a) PMMs for frequency-of-correct-answer tasks are well

adapted with respect to an environment containing a reference

class K (e.g, the set of all general-knowledge tests experienced

earlier), and (b) the actual set of questions is representative for

K, then frequency judgments are expected to be accurate.

Again, Condition A is part of our theory, and Condition B will

be realized in our experiments by using a typical set of general-

knowledge questions.

Taken together, the prediction is that the same person will

exhibit overconfidence when asked for the confidence that a

particular answer is correct and accurate estimates when asked

for a judgment of frequency of correct answers. This prediction

is shown by the two points on the left side of Figure 4. This

prediction cannot be derived from any of the previous accounts

of overconfidence.

To introduce the second prediction, we return to the wine-

tasting story Assume that K changes his habit of selecting un-

usual wines from his wine cellar, and instead buys a representa-

tive sample of French red wines and lets O guess from what

over-estimation

under-estimation

confidence

frequency

selected set representative set

Figure 4. Predicted differences between confidence and frequencyjudgments (confidence-frequency effect).

Page 7: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

512 G. GIGERENZER, U. HOFFRAGE, AND H. KLEINBOLTING

region they come. However, K does not tell O about the new

sampling technique. O's average confidence judgments will

now be close to the proportion of correct answers. In the long

run, O nevertheless expects the proportion of correct answers

to be less, still assuming the familiar testing situation in which

wines were selected, not randomly sampled. Thus, O's fre-

quency judgments will show underestimation.

Consider now a set of general-knowledge questions that is a

random sample from a denned reference class in the subject's

natural environment. We use the term natural environment to

denote a knowledge domain familiar to the subjects participat-

ing in the study. This is a necessary (although not sufficient)

condition to assume that PMMs are, on the average, well

adapted. In the experiments reported herein, we used West

German subjects and the reference class "all cities with more

than 100,000 inhabitants in West Germany" (The study was

conducted before the unification of Germany) The second pre-

diction is about this situation:

2. If the set of general-knowledge tasks is randomly sampled

from a natural environment, we expect overconfidence to be zero,

but frequency judgments to exhibit underestimation. Derivation

is as before: If PMMs for confidence tasks are well adapted with

respect to R, and the actual set of questions is a representative

sample from R, then overconfidence is expected to disappear. If

PMMs for frequency-of-correct-answers tasks are well adapted

with respect to K, and the actual set of questions is not represen-

tative for R, then frequency judgments are expected to be un-

derestimations of true frequencies.

Again, this prediction cannot be derived from earlier ac-

counts. Figure 4 shows Predictions I and 2. The predicted dif-

ferences between confidence and frequency judgments is re-

ferred to as the confidence-frequency effect.

Testing these predictions also allows for testing the assump-

tion of well-adapted PMMs for the confidence task. Assume

that PMMs are not well adapted. Then a representative sample

of city questions should not generate zero overconfidence but

rather over- or underconfidence, depending on whether cue

validities overestimate or underestimate ecological validities.

Similarly, if PMMs for frequency judgments are not well

adapted, frequency judgments should deviate from true fre-

quencies in typical general-knowledge tasks. Independent of

the degree of adaptation, however, the confidence-frequency

eifect should emerge, but the curves in Figure 4 would be trans-

posed upward or downward.

We turn now to the standard way in which overconfidence

has been demonstrated in previous research, comparing confi-

dence levels with relative frequencies of correct answers at each

confidence level. This standard comparison runs into a con-

ceptual problem well-known in probability theory and statis-

tics: A discrepancy between subjective probabilities in single

events (i.e., the confidence that a particular answer is correct)

and relative frequencies in the long run is not a bias in the sense

of a violation of probability theory, as is clear from several

points of view within probability theory. For instance, for a

frequentist such as Richard von Mises (1928/1957), probability

theory is about frequencies (in the long run), not about single

events. According to this view, the common interpretation of

overconfidence as a bias is based on comparing apples with

oranges. What if that conceptual problem is avoided and, in-

stead, the relative frequency of correct answers in each confi-

dence category is compared with the estimated relative fre-

quency in each confidence category? PMM theory makes an

interesting prediction for this situation, following the same rea-

soning as for the frequency judgments in Predictions 1 and 2

(which were estimated frequency-of-correct answers in a series

of N questions, whereas estimated relative frequencies in each

confidence category are the concern here):

3. Comparing estimated relative frequencies with true relative

frequencies of correct answers makes overestimation disappear.

More precisely, if the set of general-knowledge questions is se-

lected, over- or underestimation is expected to be zero; if the set

is randomly sampled, underestimation is expected. Thus,

PMM theory predicts that the distinction between confidence

and relative frequency is psychologically real, in the sense that

subjects do not believe that a confidence judgment of X% im-

plies a relative frequency of X%, and vice versa. We know of no

study on overconfidence that has investigated this issue. Most

have assumed instead that there is, psychologically, no differ-

ence.

Prediction 4 concerns the hard-easy effect, which says that

overconfidence increases when questions get more difficult

(e.g., Lichtenstein & Fischhoff, 1977). The effect refers to confi-

dence judgments only, not to frequency judgments. On our ac-

count, the hard-easy effect is not simply a function of difficulty.

Rather, it is a function of difficulty and a separate dimension,

selected versus representative sampling. (Note that the terms

hard and easy refer to the relative difficulty of two samples of

items, whereas the terms selected and representative refer to the

relation between one sample and a reference class in the per-

son's environment.) PMM theory specifies conditions under

which the hard-easy effect occurs, disappears, and is reversed.

A reversed hard-easy effect means that overconfidence de-

creases when questions are more difficult.

In Figure 5, the line descending from H to E represents a

hard-easy effect: Overconfidence in the hard set is larger than

in the easy set. The important distinction (in addition to hard

vs. easy) is whether a set was obtained by representative sam-

pling or was selected. For instance, assume that PMMs are well

adapted and that two sets of tasks differing in percentage

correct (i.e., in difficulty) are both representative samples from

their respective reference classes. In this case, one would expect

all points to be on the horizontal zero-overconfidence line in

Figure 5 and the hard-easy effect to be zero. More generally:

4. If two sets, hard and easy, are generated by the same sam-

pling process (representative sampling or same deviation from

representative), the hard-easy effect is expected to be zero. If sam-

pling deviates in both the hard and the easy set equally from

representative sampling, points will lie on a horizontal line par-

allel to the zero-overconfidence line.

Now consider the case that the easy set is selected from a

corresponding reference class (e.g, general-knowledge ques-

tions), but the hard set is a representative sample from another

reference class (denoted as H in Figure 5). One then would

predict a reversal of the hard-easy effect, as illustrated in Figure

5 by the double line from E to H.

5. If there are two sets, one is a representative sample from a

reference class in a natural environment, the other is selected from

another reference class for being difficult, but the representative

Page 8: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

over/underconfidence

PROBABILISTIC MENTAL MODELS

.H

513

.00

hard set easy set

Figure 5, Predicted reversal of the hard-easy effect. (H = hard; E = easy).

percentagecorrect

set is harder than the selected set; then the hard-easy effect is

reversed.

In the next section, Predictions 1,2, and 3 are tested in twoexperiments; in the Explaining Anomalies in the Literature sec-tion, Predictions 4 and 5 are checked against results in theliterature.

Experiment 1

Method

Two sets of questions were used, which we refer to as the representa-tive and the selected set. The representative set was determined in thefollowing way. We used as a reference class in a natural environment (an

environment known to our subjects) the set of all cities in West Ger-many with more than 100,000 inhabitants. There were 65 cities (Statis-tisches Bundesamt, 1986). From this reference class, a random sampleof 25 cities was drawn, and all pairs of cities in the random sample wereused in a complete paired comparison to give 300 pairs. No selection

occurred. The target variable was the number of inhabitants, and the300 questions were of the following kind: "Which city has more inhabit-ants? (a) Solingen or (b) Heidelberg." We chose city questions for tworeasons. First, and most important, this content domain allowed for aprecise definition of a reference class in a natural environment and forrandom sampling from this reference class. The second reason was for

comparability. City questions have been used in earlier studies on over-confidence (e.g., Keren, 1988; May, 1987).

In addition to the representative set, a typical set of general-knowl-

edge questions, as in previous studies, was used. This selected set of 50

general-knowledge questions was taken from an earlier study (Angele

et al, 1982). Two examples are "Who was born first? (a) Buddha or (b)Aristotle" and "When was the zipper invented? (a) before 1920 or (b)

after 1920."

After each answer, the subject gave a confidence judgment (that this

particular answer was correct). Two kinds of frequency judgmentswere used. First, after each block of 50 questions, the subject estimated

the number of correct answers among the 50 answers given. Because

there were 350 questions, every subject gave seven estimates of thenumber of correct answers. Second, after the subjects answered all

questions, they were given an enlarged copy of the confidence scale

used throughout the experiment and were asked for the following fre-quency judgment: "How many of the answers that you classified into acertain confidence category are correct? Please indicate for every cate-

gory your estimated relative frequency of correct answers."

In Experiment 1, we also introduced two of the standard manipula-tions in the literature. The first was to inform and warn half of our

subjects of the overconfidence effect, and the second was to offer half

of each group a monetary incentive for good performance. Both areamong a list of "debiasing" methods known as being relatively ineffec-tive (Fischhoff, 1982), and both contributed to the view that overconfi-dence is a robust phenomenon. If PMM theory is correct, the magni-tude of effects resulting from the two manipulations—confidenceversus frequency judgment and selected versus representative sam-pling—should be much larger than those resulting from the "debias-ing" manipulations.

Subjects. Subjects were 80 students (43 men and 37 women) at the

Page 9: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

514 G. GIOERENZER, U HOFFRAGE, AND H. KLEINBOLTING

University of Konstanz who were paid for participation. Eighty-five

percent of them grew up in the state of Baden-Wurttemberg, so thegroup was fairly homogeneous (knowledge about city populations of-ten depends on the rater's geographical location). Subjects were tested

in small groups of a maximum of 12 persons.Design and procedure. This was a 2 X 2 X 2 design with representa-

tive-selected set varied within subjects, and information-no informa-

tion about overconfidence and monetary incentive-no incentive asindependent variables varied between subjects. Half of the subjectsanswered the representative set first; the other half, the selected set.Order of questions was determined randomly in both sets.

The confidence scale consisted of seven categories, 50%, 51%-60%,61%-70%, 71%-80%, 81%-90%, 91%-99%, and 100% confident. The50%- and 100%-confidence values were introduced as separate catego-

ries because previous research showed that subjects often tend to usethese particular values. Subjects were told first to mark the alternativethat seemed to be the correct one, and then to indicate with a second

cross their confidence that the answer was correct. If they onlyguessed, they should cross the 50% category; if they were absolutelycertain, they should cross the 100% category. We explained that one of

the alternatives was always correct. In the information condition, sub-jects received the following information: "Most earlier studies found a

systematic tendency to overestimate one's knowledge; that is, therewere many fewer answers correct than one would expect from the con-fidence ratings given. Please keep this warning in mind." In the incen-

tive condition, subjects were promised 20 German marks (or a bottle ofFrench champagne), in addition to the payment that everyone received(7.50 marks), for the best performance in the group.

To summarize, 350 questions were presented, with a confidence

judgment after each question, a frequency judgment after each 50 ques-tions, and a judgment of relative frequencies of correct answers in each

confidence category at the end.

For comparison with the literature on calibration, we used the follow-ing measure:

over- or underconfidence = — ' p, - f() = p - ?,

where n is the total number of answers, nt is the number of times the

confidence judgment pt was used, and ft is the relative frequency ofcorrect answers for all answers assigned confidence p/. I is the numberof different confidence categories used (/ = 7), and p and /are the

overall mean confidence judgment and percentage correct, respec-tively. A positive difference is called overconfidence. For convenience,

we report over- and underconfidence in percentages (X100).

Results

Prediction I. PMM theory predicts that in the selected set

(general-knowledge questions), people show overestimation in

confidence judgments (overconfidence) and, simultaneously, ac-

curate frequency judgments.

The open-circle curve in Figure 6 shows the relation between

judgments of confidence and true relative frequency of correct

answers in the selected set—that is, the set of mixed general-

knowledge questions.3 The relative frequency of correct an-

swers (averaged over all subjects) was 72.4% in the 100%-confi-

dence category, 66.3% in the 95% category, 58.0% in the 85%

category, and so on. The curve is far below the diagonal (calibra-

tion curve) and similar to the curves reported by Lichtenstein,

Fischhoff, and Phillips (1982, Figure 2). It replicatesand demon-

strates the well-known overconfidence effect. Percentage

100

90

80

representative set

fj n matched set

O—O selected set

40

50 55 65 75 85

confidence (%)

95 100

Figure 6. Calibration curves for the selected set (open circles), repre-

sentative set (black squares), and matched set (open squares).

correct was 52.9, mean confidence was 66.7, and overconfi-

dence was 13.8.

Subjects' frequency judgments, however, are fairly accurate,

as Table 2 (last row) shows. Each entry is averaged over the 20

subjects in each condition. For instance, the figure —7.75

means that, on average, subjects in this condition underesti-

mated the true number of correct answers by 1.75. Averaged

across the four conditions, we get -1.2, which means that sub-

jects missed the true frequency by an average of only about 1

correct answer in the set of 50 questions. Quite accurate fre-

quency judgments coexist with overconfidence. The magni-

tudes of this confidence-frequency effect found is shown in

Figure 7 (left side). PMM theory predicts this systematic differ-

ence between confidence and frequency judgments, within the

same person and the same general-knowledge questions.

Prediction 2. PMM theory predicts that in the representa-

tive set (city questions) people show zero overconfidence and, at

the same time, underestimation in frequency judgments.

The solid-square curve in Figure 6 shows the relation be-

3 In Figure 6, we have represented the confidence category (91%-99%) by 95%, and similarly with the other categories. This choice can becriticized because numerical judgments of confidence often clusteraround specific values in an interval. (If there is a difference, however,we may expect that it affects the three curves in a similar way, withoutaltering the differences between curves.) In Experiment 2, we usedprecise values instead of these intervals.

Page 10: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

PROBABILISTIC MENTAL MODELS 515

Table 2Mean Differences Between Estimated and True Frequencies of Correct Answers

Set

Representative1-50

51-100101-150151-200201-250251-300

Average

Selected

No information-no incentive

-9.94-9.50-9.88-6.67-9.79-9.47

-9.21

-1.75

Incentiveonly

-9.42-10.37-10.89-6.70-9.84

-10.84

-9.68

-0.60

Informationonly

-8.80-11.95-10.85-9.35-7.95-9.40

-9.72

-2.65

Informationand incentive

-8.74-11.25-9.90-5.90-5.25-9.05

-8.35

0.30

Note. Negative signs denote underestimation of true number of correct answers.

tween confidence and percentage correct in the representativeset—that is, the city questions. For instance, percentage correctin the 100%-confidence category was 90.8%, instead of 72.4%.Overconfidence disappeared (-0.9%). Percentage correct andmean confidence were 71.7 and 70.8, respectively.

The confidence curve for the representative set is similar to aregression curve for the estimation of relative frequencies byconfidence, resulting in underconfidence in the left part of the

over/under-estimation

20

15

10

5

0

-5

-10

-15

-20

———— Experiment 1

— — Experiment 2

confidence

frequency

I I

selected set representative set

Figure 7. Confidence-frequency effect in the representative and se-lected set. (Solid lines show results of Experiment 1, dotted lines showthose of Experiment 2. Frequency judgments are long-run frequencies,AT =50.)

confidence scale, overconfidence in the right, and zero over-confidence on the average.

Table 2 shows the differences between estimated and truefrequencies for each block of 50 items and each of the condi-tions, respectively. Again, each entry is averaged over the 20subjects in each condition. For instance, subjects who weregiven neither information nor incentive underestimated theirtrue number of correct answers by 9.94 (on the average) in thefirst 50 items of the representative set. Table 2 shows that thevalues of the mean differences were fairly stable over the sixsubsets, and, most important, they are, without exception, nega-tive (i£., underestimation). For each of the 24 cells (representa-tive set), the number of subjects with negative differences (un-derestimation) was compared with the number of positive dif-ferences (overestimation) by sign tests, and all 24 p values weresmaller than .01.

The following is an illustration at the individual level: Subject1 estimated 28, 30, 23, 25, 23, and 23, respectively, for the sixsubsets, compared with 40,38,40,36,35, and 32 correct solu-tions, respectively. An analysis of individual judgments con-firmed average results. Among the 80 subjects, 71 underesti-mated the number of correct answers, whereas only 8 subjectsoverestimated it (frequency judgments were missing for 1 sub-ject). Incidentally, 7 of these 8 subjects were male. In the se-lected set, for comparison, 44 subjects underestimated and 35subjects overestimated the number of correct answers, and 1subject got it exactly right.

We have attributed the emergence and disappearance ofoverconfidence to selection versus use of a representative set.One objection to this analysis is that the difference between theopen-circle and the solid-square curve in Figure 6 is con-founded with a difference in the content of both sets. The se-lected set includes a broad range of general-knowledge ques-tions, whereas the domain of the representative set (cities) isnecessarily more restricted. To check for this possible con-found, we determined the item difficulties for each of the 50general-knowledge questions and selected a subset of 50 cityquestions that had the same item difficulties. If the differencein Figure 6 is independent of content, but results from the selec-tion process, this "matched" subset of city questions shouldgenerate the same calibration curves showing overconfidenceas the selected set of general-knowledge questions did. Figure 6

Page 11: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

516 G. GIGERENZER, U HOFFRAGE, AND H. K.LEINBOLTING

shows that this is the case (open-square curve). Both content

domains produce the same results if questions are selected.

To summarize, in the representative set, overestimation dis-

appears in confidence judgments, and zero-overconfidence

coexists with frequency judgments that show large underesti-

mation. Results confirm Prediction 2. Figure 7 (right side)

shows the magnitude of the confidence-frequency effect found.

No previous theory of confidence can predict the results de-

picted in Figure 7.

Prediction 3. PMM theory predicts that overestimation will

disappear if the relative frequencies of correct answers (percent-

age correct) in each confidence category is compared with the

estimated relative frequencies. Because subjects estimated per-

centage correct for all confidence judgments—that is, includ-

ing both the selected and the representative set—we expect not

only that overestimation will disappear (the prediction from

the selected set) but also that it will turn into underestimation

(the prediction from the representative set).

The solid line in Figure 8 shows the results for Experiment 1:

Estimated relative frequencies are well calibrated and show un-

derestimation in five out of seven confidence categories. Over-

estimation of one's knowledge disappears. The only exception

is the 100%-confidence category. The latter is the confidence

category that contains all solutions by local MMs, and errors in

memory or elementary logical operations may account for the

difference. Figure 8 is a "frequentist" variant of the calibration

curve of Figure 6. Here, true percentage correct is compared

with estimated percentage correct, rather than with confi-

dence. For instance, in the 100%-confidence category, true and

estimated percentage correct were 88.8% and 93.0%, respec-

tively.

50 60 70 80 90 100

estimated percentage correct

Figure 8. Calibration curves for judgments of percentage correct inconfidence categories. (Solid lines show results of Experiment 1, dottedlines show those of Experiment 2. Values are averaged across both setsof questions).

Averaged across experimental conditions, the ratio between

estimated frequency in the long run and confidence value is

fairly constant, around .87, for confidence ratings between 65%

and 95%. It is highest in the extreme categories (see Table 3).

To summarize, subjects explicitly distinguished between

confidence in single answers and the relative frequency of

correct answers associated with a confidence judgment. This

result is implied by PMM theory, according to which different

reference classes are cued by confidence and frequency tasks.

As stated in Prediction 3, overestimation disappeared. How-

ever, the magnitude of underestimation was not, as might be

expected, as pronounced as in the frequency judgments dealt

with in Predictions 1 and 2. Except for this finding, results con-

formed well to Prediction 3. Note that no previous theory of

confidence in knowledge we are aware of makes this conceptual

distinction and that prediction. Our results contradict much of

what has been assumed about how the untutored mind under-

stands the relation between confidence and relative frequency

of correct answers.

Information about overconfidence and monetary incen-

tive. Mean confidence judgments were indistinguishable be-

tween subjects informed about overconfidence and those unin-

formed. None of seven t tests, one for each confidence category,

resulted in p values smaller than .05. If a monetary incentive

was announced, overconfidence was more pronounced with

incentive than without incentive in five categories (65%-100%)

and less in the 50% category (all ps < .05), with an average

increase of 3.6%.

The monetary incentive effect resulted from the incentive/

no-information group, in which confidence judgments were

higher than in all three other groups (but we found the same

percentage correct in all groups). One reason for this interac-

tion could be that we did not specify in the instructions a crite-

rion for best performance. If warned of overconfidence, sub-

jects could easily infer that the incentive was for minimizing

overconfidence. If not warned, at least some subjects could also

have attempted to maximize percentage correct. None of these

attempts, however, was successful, consistent with PMM

theory and earlier studies (e.g., Fischhoff, Slovic, & Lichten-

stein, 1977). The effort to raise the percentage correct seems to

have raised confidence instead, an outcome that cannot be ac-

Table 3

Estimated and True Percentage Correct in Each Confidence

Category (Summarized Over the Representative

and the Selected Sets)

Confidencecategory

10091-9981-9071-8061-7051-60

50

SorM

No. ofconfidencejudgments

5,1661,6292,5342,9503,5064,0368,178

27,999

% correct

Estimated

93.082.773.164.357.353.749.8

64.8

True

88.881.674.670.165.663.356.3

69.1

Over-/under-estimation

4.21.1

-1.5-5.8-8.3-9.6-6.5

-4.2

Page 12: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

PROBABILISTIC MENTAL MODELS 517

counted for by PMM theory. The size of this effect, however,was small compared with both the confidence-frequency effectand that of selected versus representative sampling.

To summarize, neither warning of overconfidence nor asso-ciated monetary incentive decreased overconfidence or in-creased percentage correct, replicating earlier findings thatknowledge about overconfidence is not sufficient to changeconfidence. An incentive that subjects seem to have interpretedas rewarding those who maximize the percentage correct, how-ever, increased confidence.

Order of presentation and sex. Which set (representative vs.selected) was given first had no effect on confidences, neither inExperiment 1 nor in Experiment 2. Arkes, Christensen, Lai,and Blutner (1987) found an effect of the difficulty of one groupof items on the confidence judgments for a second when sub-jects received feedback for their performance in the first set. Inour experiment, however, no feedback was given. Thus, sub-jects had no reason to correct their confidence judgments, suchas by subtracting a constant value. Sex differences in degree ofoverconfidence in knowledge have been claimed by both philo-sophy and folklore. Our study, however, showed no significantdifferences between the sexes in either overconfidence or cali-bration, in either Experiment 1 or in Experiment 2. (The men'sconfidence judgments were on the average 5% higher thanwomen's, but so was their percentage correct. This replicatesLichtenstein and FischhoflS, 1981, findings about students atthe University of Oregon.)

To summarize, as predicted by PMM theory, we can experi-mentally make overconfidence (overestimation) appear, disap-pear, and invert. Experiment 1 made our subjects consistentlyswitch back and forth among these responses. The key to thisfinding is a pair of concepts that have been neglected by themain previous explanations of confidence in one's knowledge—confidence versus frequency judgment and representativeversus selected sampling.

Experiment 2

We tried to replicate the facts and test several objections.First, to strengthen the case against this theory, we instructedthe subjects both verbally and in written form that confidence issubjective probability, and that among all cases where a subjec-tive probability of X% was chosen, X% of the answers should becorrect. Several authors have argued that such a frequentist in-struction could enhance external calibration or internal consis-tency (e.g, Kahneman & Tversky, 1982; May, 1987). Accordingto PMM theory, however, confidence is already inferred fromfrequency (with or without this instruction)—but from frequen-cies of co-occurrences between, say, number of inhabitants andseveral cues, and not from base rates of correct answers in simi-lar testing situations (see Table 1). Thus, in our view, the preced-ing caution will be ineffective because the base rate of correctanswers is not a probability cue that is defined on a referenceclass such as cities in Germany.

Second, consider the confidence-frequency effect. We haveshown that this new effect is implied by PMM theory. Oneobjection might be that the difference between confidence andfrequency judgments is an artifact of the response function,just as overconfidence has sometimes been thought to be. Con-

sider the following interpretation of overconfidence. If (a) confi-dence is well calibrated but (b) the response function that trans-forms confidence into a confidence judgment differs from anidentity function, then (c) overconfidence or underconfidence"occurs" on the response scale. Because an identity function hasnot been proven, Anderson (1986), for instance, denoted theoverconfidence effect and the hard-easy effect as "largely mean-ingless" (p. 91): They might just as well be response functionartifacts.

A similar objection could be made against the interpretationof the confidence-frequency effect within PMM theory. De-spite the effect's stability across selected and representative sets,it may just reflect a systematic difference between responsefunctions for confidence and frequency judgments. This con-jecture can be rephrased as follows: If (a) the difference between"internal" confidence and frequency impression is zero, but (b)the response functions that transform both into judgmentsdiffer systematically, then (c) a confidence-frequency effect oc-curs on the response scales. We call this the response-Junctionconjecture.

How can this conjecture be tested? According to PMMtheory, the essential basis on which both confidence and fre-quency judgments are formed is the probability cues, not re-sponse functions. We assumed earlier that frequency judgmentsare based mainly on base rates of correct answers in a referenceclass of similar general-knowledge test situations. If we makeanother cue available, then frequency judgments shouldchange. In particular, if we make the confidence judgmentsmore easily retrievable from memory, these can be used as addi-tional probability cues, and the confidence-frequency effectshould decrease. This was done in Experiment 2 by introducingfrequency judgments in the short run, that is, frequency judg-ments for a very small number of questions. Here, confidencejudgments can be more easily retrieved from memory thanthey could in the long run. Thus, if PMM theory is correct, theconfidence-frequency effect should decrease in the short run. Ifthe issue were, however, different response functions, then theavailability of confidence judgments should not matter becauseconfidence and frequency impression are assumed to be identi-cal in the first place. Thus, if the conjecture is correct, theconfidence-frequency effect should be stable.

In Experiment 2, we varied the length Nofa series of ques-tions from the long run condition N= 50 in Experiment 1 to thesmallest possible short run of N = 2.

Third, in Experiment 1 we used a response scale rangingfrom 50% to 100% for confidence judgments but a full-rangeresponse scale for frequency judgments ranging from 0 to 50correct answers (which corresponds to 0% to 100%). Thereforeone could argue that the confidence-frequency effect is an arti-fact of the different ranges of the two response scales. Assumethat (a) there is no difference between internal confidence andfrequency, but (b) because confidence judgments are limited tothe upper half of the response scale, whereas frequency judg-ments are not, (c) the confidence-frequency effect results as anartifact of the half-range response scale in confidence judg-ments. We refer to this as the response-range conjecture. It can bebacked up by at least two hypotheses.

1. Assume that PMM theory is wrong and subjects indeeduse base rates of correct answers as a probability cue for confi-

Page 13: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

518 G. GIGERENZER, U HOFFRAGE, AND H. KLEINBOLTING

dence in single answers. Then confidence should be consider-

ably lower. If subjects anticipate misleading questions, even

confidences lower than 50% are reasonable to expect on this

conjecture. Confidences below 50%, however, cannot be ex-

pressed on a scale with a lower boundary at 50%, whereas they

can at the frequency scale. Effects of response range such as

those postulated in range-frequency theory (Parducci, 1965) or

by Schonemann (1983) may enforce the distorting effect of the

half-range format. In this account, both the overconfidence ef-

fect and the confidence-frequency effect are generated by a

response-scale effect. With respect to overconfidence, this con-

jecture has been made and has claimed some support (e.g. May,

1986, 1987; Ronis & Yates, 1987). We call this the base rate

hypothesis.

2. Assume that PMM theory is wrong in postulating that

choice and confidence are essentially one process and that the

true process is a temporal sequence: choice, followed by search

for evidence, followed by confidence judgment. Koriat et al.

(1980), for instance, proposed this sequence. Assume further,

contrary to Koriat, that the mind is "Popperian," searching for

disconfirming rather than for confirming evidence to deter-

mine the degree of "corroboration" of an answer. If the subject

is successful in retrieving disconfirming evidence from mem-

ory, but is not allowed to change the original answer, confidence

judgments less than 50% will result. Such disconfirmation strat-

egies, however, can hardly be detected using a 50%-100% for-

mat, whereas they could in a full-scale format. We call this the

disconfirmation strategy hypothesis.

To test the response-range conjecture, half of the subjects in

Experiment 2 were given full-range response scales, whereas

the other half received the response scales used in Experi-

ment 1.

Method

Subjects. Ninety-seven new subjects at the University of Konstanz

(not enrolled in psychology) were paid for participation. There were 59male and 38 female subjects. As in Experiment 1, subjects were testedin small groups of no more than 7 subjects.

Design and procedure. This was a 4 X 2 X 2 design, with length of

series (50,10, 5, and 2) and response scale (half range vs. full range)varied between subjects and type of knowledge questions (selected vs.representative set) varied within subjects.

The procedure and the materials were like that in Experiment 1,except for the following. We used a new random sample of 21 (insteadof 25) cities. This change decreased the number of questions in the

representative set from 300 to 210. As mentioned earlier, we explicitlyinstructed the subjects to interpret confidences as frequencies ofcorrect answers: "We are interested in how well you can estimate sub-jective probabilities. This means, among all the answers where you give

a subjective probability of X%, there should be X% of the answerscorrect." This calibration instruction was orally repeated and empha-sized to the subjects.

The response scale contained the means (50%, 55%, 65%,. . . ,95%,100%) of the intervals used in Experiment 1 rather than the intervals

themselves to avoid the problematic assumption that means wouldrepresent intervals. Endpoints were marked absolutely certain that the

alternative chosen is correct (100%), both alternatives equally probable(50%), and, for the full-range scale, absolutely certain that the alternativechosen is incorrect (0%). In the full-range scale, one reason for usingconfidences between 0% and 45% was explained in the following illus-

tration: "If you think after you have made your choice that you wouldhave better chosen the other alternative, do not change your choice, butanswer with a probability smaller than 50%."

After each set of N= 50 (10,5, or 2) answers, subjects gave a judgmentof the number of correct answers. After having completed 50 + 210 =260 confidence judgments and 5, 26, 52, or 130 frequency judgments(depending on the subject's group), subjects in both response-scale con-

ditions were presented the same enlarged copy of the 50%-100% re-sponse scale and asked to estimate the relative frequency of correctanswers in each confidence category.

Results

Response-range conjecture. We tested the conjecture that the

systematic difference in confidence and frequency judgments

stated in Predictions 1 and 2 (confidence-frequency effect) and

shown in Experiment 1 resulted from the availability of only a

limited response scale for confidence judgments (50% to 100%).

Forty-seven subjects were given the full-range response scale

for confidence judgments. Twenty-two of these never chose

confidences below 50%; the others did. The number of confi-

dence judgments below 50% was small. Eleven subjects used

them only once (in altogether 260 judgments), 5 did twice, and

the others 3 to 7 times. There was one outlier, a subject who

used them 67 times. In total, subjects gave a confidence judg-

ment smaller than 50% for only 1.1% of their answers (excluding

the outlier: 0.6%). If the response-range conjecture had been

correct, subjects would have used confidence judgments below

50% much more frequently.

In the representative set, overconfidence was 3.7% (SEM =

1.23) in the full-range scale condition and 1.8% (SEU= 1.15) in

the half-range condition. In the selected set, the corresponding

values were 14.4 (SEM = 1.54) and 16.4 (SEM= 1.43). Averaging

all questions, we got slightly larger overconfidence in the full-

range condition (mean difference = 1.2). The response-range

conjecture, however, predicted a strong effect in the opposite

direction. Frequency judgments were essentially the same in

both conditions. Hence, the confidence-frequency effect can

also be demonstrated when both confidence and frequency

judgments are made on a full-range response scale.

To summarize, there was (a) little use of confidences below

50% and (b) no decrease of overconfidence in the full-range

condition. These results contradict the response-range conjec-

ture.

A study by Ronis and Yates (1987) seems to be the only other

study that has compared the full-range and the half-range for-

mat in two-alternative choice tasks, but it did not deal with

frequency judgments. These authors also reported that only

about half their subjects used confidence judgments below 50%,

although they did so more frequently than our subjects. Ronis

and Yates concluded that confidences below 50% had only a

negligible effect on overconfidence and calibration (pp. 209-

211). Thus, results in both studies are consistent. The main

difference is that Ronis and Yates seem to consider only "failure

to follow the instructions" and "misusing the probability scale"

(p. 207) as possible explanations for confidence judgments be-

low 50%. In contrast, we argue that there are indeed plausible

cognitive mechanisms—the base rate and disconfirmation strat-

egy hypotheses—that imply these kind of judgments, although

they would contradict PMM theory

Page 14: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

PROBABILISTIC MENTAL MODELS 519

Both Experiment 2 and the Ronis and Yates (1987) study do

not rule out, however, a more fundamental conjecture that is

difficult to test. This argument is that internal confidence (not

frequency) takes a verbal rather than a numerical form and that

it is distorted on any numerical probability rating scale, not just

on a 50%-100% response scale. Zimmer (1983, 1986) argued

that verbal expressions of uncertainty (such as "highly improb-

ably" and "very likely") are more realistic, more precise, and

less prone to overconfidence and other so-called judgmental

biases than are numerical judgments of probability. Zimmer s

fuzzy-set modeling of verbal expressions, like models of proba-

bilistic reasoning that dispense with the Kolmogoroff axioms

(e.g., Cohen, 1989; Kyburg, 1983; Shafer, 1978), remains a largely

unexplored source of alternative accounts of confidence.

For the remaining analysis, we do not distinguish between

the full-range and the half-range response format. For combin-

ing the data, we receded answers like "alternative a, 40% confi-

dent" as "alternative b, 60% confident," following Ronis and

Yates (1987).

Predictions I and 2: Confidence-frequency effect. The ques-

tion is whether the confidence-frequency effect can be repli-

cated under the explicit instruction that subjective probabilities

should be calibrated to frequencies of correct answers in the

long run. Calibration curves in Experiment 2 were similar to

those in Figure 6 and are not shown here for this reason. Figure

7 shows that the confidence-frequency effect replicates. In the

selected set, mean confidence was 71.6%, and percentage

correct was 56.2. Mean estimated number of correct answers

(transformed into percentages) in the series ofN= 50 was 52.0%.

As stated in Prediction 1, overconfidence in single answers

coexists with fairly accurate frequency judgments, which once

again show slight underestimation.

In the representative set, mean confidence was 78.1% and

percentage correct was 75.3%.4 Mean estimated number of

correct answers per 50 answers was 63.5%. As forecasted in

Prediction 2, overconfidence largely disappeared (2.8%), and

frequency judgments showed underestimation (—11.8%).

An individual analysis produced similar results. The confi-

dence-frequency effect (average confidence higher than average

frequency judgment) held for 82 (83) subjects in the selected

(representative) set (out of 97). Answering the selected set, 92

respondents showed overconfidence, and 5 showed underconfi-

dence. In the representative set, however, 60 exhibited overcon-

fidence and 37, underconfidence.

Prediction 3: Estimated percentage correct in confidence cate-

gories. After the subjects answered the 260 general-knowledge

questions, they were asked what percentage they thought they

had correct in each confidence category. As shown by the

dashed line in Figure 8, results replicated well. Average esti-

mated percentage correct differed again from confidence and

was close to the actual percentage correct.

Despite the instruction not to do so, our subjects still distin-

guished between a specific confidence value and the corre-

sponding percentage of correct responses. Therefore confidence

and hypothesized percentage correct should not be used as syn-

onyms (e.g., Dawes, 1980, pp. 331-345). As suggested by this

experiment, an instruction alone cannot override the cognitive

processes at work.

In the 100%-confidence category, for instance, 67 subjects

gave estimates below 100%. In a postexperimental interview, we

pointed out to them that these judgments imply that they as-

sumed they had not followed the calibration instruction. Most

subjects explained that in each single case, they were in fact

100% confident. But they also knew that, in the long run, some

answers would nonetheless be wrong, and they did not know

which ones. Thus, they did not know which of the 100% answers

they should correct. When asked how they made the confi-

dence judgments, most subjects answered by giving examples

of probability cues, such as "I know that this city is located in

the Ruhigebiet, and most cities there are rather large." Inter-

views provided evidence for several probability cues, but no

evidence that base rate expectations, as reported in frequency

judgments, were also used in confidence judgments.

Response-function conjecture: frequency judgments in the

short and long runs. We tested the conjecture that the confi-

dence-frequency effect stated in Predictions 1 and 2 and shown

in Experiment 1 might be due to different response functions

for confidence and frequency judgments, rather than to differ-

ent cognitive processes as postulated by PMM theory If the

conjecture were true, the availability of confidence judgments

in the short run should not change the confidence-frequency

effect (see the previous discussion).

Contrary to the response-function conjecture, the length of

series showed a significant effect on the judgments of frequency

of correct answers in each series (p = .025) as well as on the

difference between judged and true frequency (p = .012). Fig-

ure 9 shows the extent of the disappearance of the confidence-

frequency effect in the short run. The curve shows that the

effect decreased from N = 50 to N = 2, averaged across both sets

of items. The decrease was around 12%, an amount similar in

the selected set (from 18.9% to 6.9%) and in the representative

set (from 15.7% to 3.3%).

As would be expected from both the response-function con-

jecture and PMM theory, an analysis of variance over all 260

questions showed no significant effect of length of series (short

vs. long runs) on either confidence judgment (p = .39) or num-

ber of correct answers (p = .40). (Similar results were obtained

when the selected and representative sets were tested separately.)

The breakdown of the confidence-frequency effect in the

short run is inconsistent with the objection that the effect can

be reduced to a systematic difference in response functions.

This result is, however, consistent with the notion that the

shorter the run, the more easily are confidence judgments avail-

able from memory, and, thus, the more they can be used as

probability cues for the true number of correct answers.

Discussion

Our starting point was the overconfidence effect, reported in

the literature as a fairly stable cognitive illusion in evaluating

one's general knowledge and attributed to general principles of

memory search, such as confirmation bias (Koriat et al., 1980),

4 Confidence and percentage correct are averaged across all fourconditions (series length) because these do not differ systematicallyamong conditions. For comparison, the corresponding values for con-fidence and percentage correct in the ff = 50 condition are 71.0 and56.8 in the selected set and 79.2 and 74.5 in the representative set.

Page 15: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

520 G. QIGERENZER, U. HOFFRAGE, AND H. KLEINBOLTING

difference

between 15

confidence

and

frequency

judgments

10

50 10 5 2

length of series

Figure 9. Decrease of the confidence-frequency effect in short runs

(N= 50,10,5, and 2). (Values are differences between mean confidence

and estimated percentage correct in a series of length N. Values are

averaged across all questions).

to general motivational tendencies such as fear of invalidity(Mayseless & Kruglanski, 1987), to insensitivity to task diffi-culty (see von Winterfeldt & Edwards, 1986, p. 128), and towishful thinking and other "deficits" in cognition, motivation,and personality Our view, in contrast, proposes that one evalu-ates one's knowledge by probabilistic mental models. In ouraccount, the main deficit of most cognitive and motivationalexplanations is that they neglect the structure of the task and itsrelation to the structure of a corresponding environmentknown to the subjects. If people want to search for confirmingevidence or to believe that their answers are more correct thanthey are because of some need, wish, or fear, then overestima-tion of accuracy should express itself independently of whetherthey judge single answers or frequencies, a selected or represen-tative sample of questions, and hard or easy questions.

Our experiments also do not support the explanation of over-confidence and the hard-easy effect by assuming that subjectsare insensitive to task difficulty: In frequency tasks we haveshown that subjects' judgments of their percentage correct inthe long run are in fact close to actual percentage correct, al-though confidences are not. Overconfidence does not implythat subjects are not aware of task difficulty. At least two morestudies have shown that estimated percentage correct cancorrespond closely to true percentage correct in general-knowledge tasks. Allwood and Montgomery (1987) asked theirsubjects to estimate how difficult each of 80 questions was fortheir peers and found that difficulty ratings (M = 57%) weremore realistic (percentage correct = 61%) than confidence judg-ments (M = 74%). May (1987) asked her subjects to estimatetheir percentage of correct answers after they completed anexperiment with two-alternative questions. She found thatjudgments of percentage correct accorded better with the truepercentage correct than did confidences.

On our account, over-confidence results from one of twocauses, or both: (a) A PMM for a task is not property adapted toa corresponding environment (e.g, cue validities do not corre-spond to ecological validities), or (b) the set of objects used isnot a representative sample from the corresponding referenceclass in the environment but is selected for difficulty. If a is thetrue cause, using a representative sample from a known environ-ment should not eliminate Overconfidence. If b is true, it

should. In both experiments, Overconfidence in knowledgeabout city populations was eliminated, as implied by b. Thus,experimental results are consistent with both PMM theory andthe assumption that individual PMMs are on the average welladapted to the city environment we used.5 Overconfidence re-sulted in a set of questions that was selected for difficulty. Un-derconfidence, conversely, would result from questions selectedto be easy.

The foregoing comments do not mean that overestimation ofknowledge is just an artifact of selected questions. If it were,then judgments of frequency of correct answers should show asimilar degree of overestimation. What we have called the con-fidence-frequency effect shows that this is not the case.

Several authors have proposed that judgments in the fre-quency mode are more accurate, realistic, or internally consis-tent than probabilities for single events (e.g., Teigen, 1974, p. 62;Tversky & Kahneman, 1983). Our account is different. PMMtheory states conditions under which mean judgments of confi-dence are systematically larger than judgments of relative fre-quency. PMM theory does not, however, imply that frequencyjudgments are generally better calibrated. On the contrary, fre-quency judgments may be miscalibrated for the same reasonsas confidence judgments. The set of tasks may not be represen-tative for the reference class from which the inferences aremade.

The experimental control of overestimation—how to makeoverestimation appear, disappear, and invert—gives support toPMM theory. These predictions, however, do not exhaust theinferences that can be derived from PMM theory.

Explaining Anomalies in the Literature

In this section, we explain a series of apparently inconsistentfindings and integrate these into PMM theory.

Ronis and Yates (1987). We have mentioned that the Ronisand Yates (1987) study is the only other study that tested afull-range response scale for two-alternative tasks. The secondpurpose of that study was to compare confidence judgments insituations where the subject knows that the answers are knownto the experimenter (general-knowledge questions) with out-comes of upcoming basketball games, where answers are notyet known. In all three (response-scale) groups, percentagecorrect was larger for general-knowledge questions than forbasketball predictions. Given this result, what would currenttheories predict about Overconfidence? The insensitivity hy-pothesis proposes that people are largely insensitive to percent-age correct (see von Winterfeldt & Edwards, 1986, p. 128). Thisimplies that Overconfidence will be larger in the more difficult(hard) set: the hard-easy effect. (The confirmation bias andmotivational explanations are largely mute on the difficulty is-sue.) PMM theory, in contrast, predicts that Overconfidence willbe larger in the easier set (hard-easy effect reversal, see Predic-tion 5) because general-knowledge questions (the easy set) were

3 After finishing this article, we learned about a study by Juslin

(1991), in which random samples were drawn from several natural

environments. Overall, Overconfidence in general knowledge was close

to zero, consistent with this study.

Page 16: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

PROBABILISTIC MENTAL MODELS 521

selected and basketball predictions were not; only with clairvoy-ance could one select these predictions for percentage correct.

In fact, Ronis and Yates (1987) reported an apparent anom-aly: three hard-easy effect reversals. In all groups, overconfi-dence was larger for the easy general-knowledge questions thanfor the hard basketball predictions (Figure 10). Ronis and Yatesseem not to have found an explanation for these reversals of thehard-easy effect.

Koriat et al. (1980). Experiment 2 of Koriat et al.'s (1980)study provided a direct test of the confirmation bias explana-tion of overconfidence. The explanation is this: (a) Subjects firstchoose an answer based on their knowledge, then (b) they selec-tively search for confirming memory (or for evidence discon-firming the alternative not chosen), and (c) this confirming evi-dence generates overconfidence. Between the subjects' choiceof an answer and their confidence judgment, the authors askedthe subjects to give reasons for the alternative chosen. Threegroups of subjects were asked to write down one confirmingreason, one disconfirming reason, or one of each, respectively.Reasons were given for half of the general-knowledge ques-tions; otherwise, no reasons were given (control condition). Ifthe confirmation bias explanation is correct, then asking for acontradicting reason (or both reasons) should decrease overcon-fidence and improve calibration. Asking for a confirming rea-son, however, should make no difference "since those instruc-tions roughly simulate what people normally do" (Koriat et al.,1980, p. 111).

What does PMM theory predict? According to PMM theory,choice and confidence are inferred from the same activated cue.This cue is by definition a confirming reason. Therefore, theconfirming-reason and the no-reason (control) tasks engage thesame cognitive processes. The difference is only that in theformer the supporting reason is written down. Similarly, thedisconfirming-reason and both-reason tasks involve the same

over/under-confidence

, onis&Yate«(L987):No-choice-ISO

. :onis*Yat»s(1987>:Choice-50

'Ranis & Yates(1987):Choice-l(H)

percentagecorrect

Figure 10. Reversal of the hard-easy effect in Ronisand Yates (1987) and Keren (1988).

cognitive processes. Furthermore, PMM theory implies thatthere is no difference between the two pairs of tasks.

This result is shown in Table 4. In the first row we have theno-reason and confirming-reason tasks, which are equivalent.Here, only one cue is activated, which is confirming. There is nodisconfirming cue. Now consider the second row, the discon-firming-reason and both-reason tasks, which are again equiva-lent. Both tasks are solved if one additional cue, which is dis-confirming, can be activated. Thus, for PMM theory, the cuegeneration and testing cycle is started again, and cues are gener-ated according to the hierarchy of cue validities and testedwhether they can be activated for the problem at hand. Thepoint is that the next cue that can be activated may turn out tobe either confirming or disconfirming.

Far simplicity, assume that the probability that the next acti-vated cue turns out to be confirming or disconfirming is thesame. If it is disconfirming, the cycle is stopped, and two cuesin total have been activated, one confirming and one discon-firming. This stopping happens with probability .5, and it de-creases both confidence and overconfidence. (Because the sec-ond cue activated has a smaller cue validity, however, confi-dence is not decreased below 50%) If the second cue activated isagain confirming, a third has to be activated, and the cue gener-ation and testing cycle is entered again. If the third cue is dis-confirming, the cycle stops with two confirming cues and onedisconfirming cue activated, as shown in the third row of Table4. This stopping is to be expected with probability .25. Becausethe second cue has higher cue validity than the third, discon-firming, cue, overall an increase in confidence and overconfi-dence is to be expected. If the third cue is again confirming, thesame procedure is repeated. Here and in all subsequent casesconfidence will increase. As shown in Table 4, the probabilitiesof an increase sum up to .5 (.25 + . 125 + . 125), which is the sameas the probability of a decrease.

Thus, PMM theory leads to the prediction that, overall, ask-ing for a disconfirming reason will not change confidence oroverconfidence. As just shown, the confirmation-bias hypothe-sis, in contrast, predicts that asking for a disconfirming reasonshould decrease confidence and overconfidence.

What were the results of the Koriat study? In both crucialconditions, disconfirming reason and both reasons, the authorsfound only small and non-significant decreases of overconfi-dence (2% and 1%, respectively) and similar small improve-ments in calibration (006 each, significant only in the discon-firming-reason task). These largely insignificant differences areconsistent with the prediction by PMM theory that asking for adisconfirming reason makes no difference and are inconsistentwith the confirmation-bias explanation. Further evidencecomes from a replication of the Koriat study by Fischhoff andMacGregor (1982), who reported zero effects of disconfirmingreasons.

To summarize, the effects on confidence of giving confirm-ing and disconfirming reasons in the Koriat study can be bothexplained by and integrated into PMM theory. There is no needto postulate a confirmation bias.

Dawes (1980). Overconfidence has been attributed to peo-ple's tendency to "overestimate the power of our 'intellect' asopposed to that of our coding abilities." Such overestimation"has been reinforced by our realization that we have developed

Page 17: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

522 G. GIGERENZER, U. HOFFRAGE, AND H. KLEINBOLTING

Table 4Predictions ofPMM Theory for the Effects of Asking for a Disconfirming Reason

Cues activatedPredicted change in

Task No. of cues activated CON DIS Probability confidence

No; CON

DIS; both

DIS; both

DIS; both

DIS; both

1

2

3

4

>4

CON/ \

DIS CON/

DIS\CON/ \

DIS CON

1

1

2

3

>3

0

1

1

1

1

.5

.25

.125

.125

Decrease

Increase

Increase

Increase

Note. CON = confirming reason; DIS = disconfirming reason; No = no reason; both = both reasons.

a technology capable of destroying ourselves" (Dawes 1980, p.328). Dawes (1980) proposed that overconfidence is character-istic for general-knowledge questions but absent in perceptualtasks; he designed a series of experiments to test this proposal.PMM theory, however, gives no special treatment to perceptualtasks. On the contrary, it predicts overconfidence if perceptualtasks are selected for perceptual illusions—that is, for beingmisleading—whereas zero overconfidence is to be expected iftasks are not selected. Pictures in textbooks on visual illusionsare probably the set of items that produces the most extremeoverconfidence yet demonstrated. Nevertheless, in a naturalenvironment, perception is generally reliable.

Dawes reported inconsistent results. When perceptual stim-uli were systematically constructed from a Square x Circle ma-trix, as in the area task, and no selection for stimuli that gener-ated perceptual illusions took place, overconfidence was closeto zero (perception of areas of squares is quite well adapted inadults; see Gigerenzer & Richter, 1990). This result is predictedby both accounts. The anomaly arises with the second percep-tual task used—judging which of two subsequent tones islonger.6 If the second tone was longer, Dawes reported almostperfect calibration, but if the first tone was longer, subjectsexhibited large overconfidence.

PMM theory predicts that in the inconsistent acoustic task,perceptual stimuli have been selected (albeit unwittingly) for aperceptual illusion. This is in fact the case. From the literatureon time perception, we know that of two subsequently pre-sented tones, the tone more recently heard appears to be longer.This perceptual illusion is known as the negative presentation

effect (e.g., Fraisse, 1964; Sivyer & Finlay, 1982). It implies asmaller percentage of correct answers in the condition wherethe tone presented first was longer, because this tone is per-ceived to be shorter. A decrease in percentage correct in turnincreases overconfidence. In Dawes's (1980) experiments, this isexactly the inconsistent condition where overconfidence oc-curred. Thus, from the perspective we propose, this inconsis-tent result can be reconciled.

Keren (1988). A strict distinction between perceptual judg-ment and intellectual judgment cannot be derived from manyviews of perception, such as signal-detection theory (Tanner &Swets, 1954). Reflecting on this fact, Keren (1988) proposed aslightly modified hypothesis: The more perceptionlike a task is,

the less overconfident and the better calibrated subjects will be."As a task requires additional higher processing and transfor-mation of the original sensory input, different kinds of possiblecognitive distortions may exist (such as inappropriate infer-ences) that may limit the ability to accurately monitor ourhigher cognitive processes" (Keren, 1988, p. 99).

Keren (1988, Experiment 1) used general-knowledge ques-tions and two kinds of perceptual tasks, one of them moredifficult than the general-knowledge task, the other less diffi-cult. Keren tested the hypothesis that confidence judgments inperceptual tasks are better calibrated than in general-knowl-edge tasks. He could not support it, however. Instead, he foundan anomaly: The comparison between the general-knowledgetask and the more difficult perceptual task reversed the hard-easy effect (see Figure 10). As derived in Prediction 5, this puz-zling reversal is implied by PMM theory if the Landolt ringsused in the more difficult perceptual task were not selected forperceptual illusions, as seems to be the case (Keren, 1988,p. 100).

Note that the kind of general-knowledge questions used (pop-ulation of cities or countries, and distances between cities)would easily permit defining a reference class in a known envi-ronment and obtaining representative samples. But no represen-tative sample of general-knowledge questions was generated.This lack makes the other predictions from PMM theory coin-cide with Keren's (1988): overconfidence in general-knowledge,and zero overconfidence in the two perceptual tasks. Resultsshow this outcome, except for the large-gap Landolt rings con-dition, which generated considerable underconfidence. PMMtheory cannot account for the latter, nor can the notion of de-gree of perception-likeness. j

A second perceptual task was lettdr identification. In Experi-ment 3, Keren (1988) used two letter-identification tasks, whichwere identical except that the exposure time of the letters to berecognized was either short or long. Mean percentages correctwere 63.5 for short and 77.2 for long exposures. According toearlier explanations such as subjects' insensitivity to task diffi-culty, a hard-easy effect should result. According to PMM

' Dawes's eye-color task is not dealt with here because it is a memory

task, not a perceptual task.

Page 18: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

PROBABILISTIC MENTAL MODELS 523

theory, however, the hard-easy effect should be zero, becauseboth tasks were generated by the same sampling process (Pre-diction 4). In fact, Keren (1988, p. 112) reported that in bothtasks, overconfidence was not significantly different from zero.He seems to have found no explanation for this disappearanceof the hard-easy effect in a situation where differences in per-centage correct were large.7

Mental Models as Probabilistic Syllogisms

To the best of our knowledge, there is only one other mentalmodels approach to confidence. May (1986,1987) emphasizedthe role of mental models to understand the mechanism ofconfidence judgments and the role of misleading questions tocause overconfidence. Consider again the following question:"Which city has more inhabitants? (a) Hyderabad or (b) Islama-bad." May proposed that subjects answer this question by con-structing a mental model that can be expressed as a "probabilis-tic syllogistic inference" (May, 1986, p. 21):

"Most capitals have quite a lot of inhabitants.Islamabad is a capital.

Presumably, Islamabad has quite a lot of inhabitants."

Replacing "most" by the probability p(large|capital) = 1 -alpha, that is, the probability that a city has a large population ifit is a capital, she made the following argument:

"The subjective probability therefore has to be '1 - alpha.' Giventhat the perception of that percentage is correct, the observedfrequency of correct answers will be 1 - alpha. Even if there wasrandom fluctuation of subjective probability and task difficulty, inthe long run calibration is expected" (May, 1986, p. 20).

May (1986,1987) did highly interesting analyses of individualgeneral-knowledge questions—among others, direct tests of theuse of the capital cue and the familiarity cue in questions of theHyderabad-Islamabad type. Both May and PMM theory sharean emphasis on the mental models subjects use to construetheir tasks. We discuss here two issues where we believe thatMay^ position could be strengthened: the kind of mentalmodel she proposed and the role of misleading questions.

First, we show that the probabilistic syllogism does not workin the sense she specified, that is, generating long-run calibra-tion. We propose a working version. The general reason why thesyllogism mental model does not work in a two-alternative taskis that it does not deal with information about the alternative,that is, whether Hyderabad is known as a capital or noncapitalor whether its status is unknown. Specifically, assume that thesubject knows nothing about Hyderabad, as May did (1986, p.19). The syllogism produces the choice a and the confidencejudgment p(p = large|a = capital), where a stands for Islamabad.However, this confidence judgment is not equal to the long-runfrequency of correct answers. The long-run frequency of correctanswers p<fl larger than b\a = capital) depends on both p(a =largeja = capital) and p(b = large|6 = city), where b stands forHyderabad and "city" means no knowledge of whether b is acapital or a noncapital. For instance, the larger p(b = large|£ =city), the smaller the long-run frequency of correct answers.

A mental model that generates confidence judgments thatare well calibrated with long-run frequencies of correct answerscan be constructed in at least two ways. First, the probabilistic

syllogism can be supplanted by a second syllogism that uses ourknowledge about the alternative—capital, noncapital, or justcity (no knowledge of whether it is a capital or noncapital). Hereis a numerical illustration of what can be called the double-syl-logism model:

80% of capitals have a large population.Islamabad is a capital.

The probability is .80 that Islamabad has a large population.

40% of cities have a large population.Hyderabad is a city.

The probability is .40 that Hyderabad has a large population.

What is the long-run frequency of correct answers? There arefour possible classes of events: a - large and b = not large; b =large and a = not large; a = large and b = large; and a = not largeand b = not large. Given the choice a, the first class of eventssignifies correct choices; the second, incorrect choices; and thethird and fourth do not contain discriminating information.The long-run frequency of correct answers consists of the firstclass of events and of half of the third and fourth—on the as-sumption that half of the nondiscriminating cases will becorrect and the other half incorrect. Thus, the long-run fre-quency of correct answers is #(a = large, b = not largeja = capi-tal) + 'A(p[a = large, b = largeja = capital] + p[a = not large, b =not large|a = capital)). We denote the probabilities from thefirst and the second syllogism as a and 0, respectively. Then, thelong-run frequency of correct answers is a(l — {Tf + Vilpfl + [1 —<*][! ~ W> which is Vrfpt -0 + 1). For instance, if a = /J, thisprobability is .50. Therefore, in the above double-syllogismmodel, the (calibrated) confidence that a is correct is 'Afpt - 0 +1). May (1986), in contrast, proposed a. For the double syllo-gism, we get V2(.80 - .40 + 1) = .70.

A second solution would be to dispense with the dichotomyof large versus small population and to use the cue validities asdefined in PMM theory. Both changes would make May's men-tal models work and would lead to a mechanism that differs in

7 Keren (1987, 1988; Wagenaar & Keren, 1985) also distinguished

tasks in which the items are related (e.g., repeated weather forecasting)

versus unrelated (e.g., typical general-knowledge questions). A similar

distinction was made by Ronis and Vates (1987). PMM theory can

connect Keren's distinction between two kinds of tasks with a model of

cognitive processes involved in different tasks. In a set of unrelated

items, a new PPM has to be constructed for each new item that cannotbe answered by a local MM. This new PMM includes a new reference

class, new target variable, and new cues and cue validities. This holds

for reasoning about a set of typical general-knowledge questions. In

contrast, the representative set of city questions used in our experi-

ments implies that the PMMs for subsequent items include the same

reference class, same target value, and same hierarchy of cues and cue

validities but that different cues will be activated in different ques-

tions. Thus, in this framework, the distinction between related and

unrelated items is neither a dichotomy nor a single continuum, but

multidimensional. In general, a set of items can cue a series of PMMs

that have (a) the same-difierent reference class, (b) the same-differenttarget variable, (c) the same-different set of cues and cue validities, and

(d) the same-different activated cues. Thus, at the other extreme of the

typical general-knowledge task, there is a series of tasks that implies

the construction of a succession of PMMs that are identical with re-

spect to all four dimensions. An example is the repeated judgment of

the frequency of correct answers in our experiments.

Page 19: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

524 G. GIGERENZER, U HOFFRAGE, AND H. KLEINBOLTING

an interesting way from that illustrated in Figure 3. In contrast

to what we have proposed, May (1986) assumed that cues are

activated even if knowledge from memory can be retrieved only

for one alternative.

May (1986,1987) also proposed that overconfidence is due to

misleading items. The Islamabad-Hyderabad question is one

example of a misleading item with less than 50% correct an-

swers. Most subjects chose Islamabad, whereas Hyderabad has

a much larger population. An extreme example is "Three

fourths of the world's cacao comes from (a) Africa or (b) South

America" for which Fischhoffet al. (1977) reported only 4.8%

correct answers. Fischhoffet al. showed that extreme overconfi-

dence, such as odds greater than 50:1, are prevalent in what they

called deceptive items (more than 73%) but still exist in nonde-

ceptive items (less than 9%). Because of the latter finding,

among other grounds, they concluded that misleading items

"are not responsible for the extreme overconfidence effect" (p.

561). In contrast, May (1986,1987) seems to have held that if

misleading items were eliminated, overconfidence would be,

too. But she also emphasized, as we do, the role of representa-

tive sampling.

We propose that the issue of misleading questions can be

fully reduced to the issue of representative sampling from a

reference class, which provides a deeper understanding of con-

fidence in knowledge. As we have pointed out before, the same

set of general-knowledge questions can be nonrepresentative

with respect to one reference class (e.g., all cities in Germany),

but representative with respect to a different reference class (ag,

sets of typical general-knowledge questions). The notion of mis-

leading items does not capture this distinction, which is both

essential to PMM theory as well as to an explanation for when

and why overconfident subjects can quite realistically estimate

their true relative frequencies of correct responses.

The Brunswikian Perspective

PMM theory draws heavily on the Brunswikian notions of a

natural environment known to an individual, reference classes

in this environment, and representative sampling from a refer-

ence class. We went beyond the Brunswikian focus on achieve-

ment (rather than process) by providing a theoretical frame-

work of the processes that determine choice, confidence, and

frequency judgment.

Choice and confidence are a result of a cue-testing and acti-

vation cycle, which is analogous to Newell and Simon's (1972)

postulate that "problem solving takes place by search in a prob-

lem space" (p. 809). Furthermore, the emphasis on the structure

of the task in PMM theory is similar to Newell and Simon's

proposition that "the structure of the task determines the possi-

ble structures of the problem space" (p. 789). Unlike PMM

theorists, however, Newell and Simon also assumed in the tasks

they studied (cryptarithmetic, logic, and chess) a relatively sim-

ple mapping between the external structure of the task and the

internal representation in a problem space (see Allport, 1975).

Although it is cued by the task structure, we assume that a

PMM (the functional equivalent of a problem space) has a large

surplus structure (the reference class and the cues), which is

taken from a known structure in the problem solver's natural

environment. The emphasis on the structure of everyday knowl-

edge or environment (as distinguished from the task environ-

ment) has been most forcefully defended by Brunswik (see Gi-

gerenzer, 1987). Although Newell and Simon (1972, p. 874)

called Brunswik and Tolman "the real fore runners" of their

work, they seem not to distinguish clearly between the notions

of a probabilistic everyday environment and a task environ-

ment. This theory is an attempt to combine both views. Bruns-

wik's focus on achievement (during his behavioristic phase; see

Leary, 1987) corresponds more closely to the part of research on

probabilistic judgment that focuses on calibration, rather than

on the underlying cognitive processes.

The importance of the cognitive representation of the task

was studied by the Wiirzburg school and emphasized in Gestalt

theoretical accounts of thinking (e.g., Duncker, 19 35/1945), and

this issue has recently regained favor (e.g, Brehmer, 1988; Ham-

mond, Stewart, Brehmer, & Steinman, 1975). In their review,

Einhora and Hogarth (1981) emphasized that "the cognitive

approach has been concerned primarily with how tasks are rep-

resented. The issue of why tasks are represented in particular

ways has not yet been addressed" (p. 57). PMM theory ad-

dresses this issue. Different tasks, such as confidence and fre-

quency tasks, cue different reference classes and different proba-

bility cues from known environments. It is these environments

that provide the particular representation, the PMM, of a task.

Many parts of PMM theory need further expansion, develop-

ment, and testing. Open issues include the following: (a) What

reference class is activated? For city comparisons, this question

has a relatively clear answer, but in general, more than one

reference class can be constructed to solve a problem, (b) Are

cues always generated according to their rank in the cue validity

hierarchy? Alternative models of cue generation could relax this

strong assumption, assuming, for instance, that the first cue

generated is the cue activated in the last problem. The latter

would, however, decrease the percentage of correct answers, (c)

What are the conditions under which we may expect PMMs to

be well adapted? There exists a large body of neo-Brunswikian

research that, in general, indicates good adaptation but also

points out exceptions (e.g, K. Armelius, 1979; Brehmer & Joyce,

1988; Bjorkman, 1987; Hammond & Wascoe, 1980). (d) What

are the conditions under which cue substitution without cue

integration is superior to multiple cue integration? PMM theory

assumes a pure cue substitution model—a cue that cannot be

activated can be replaced by any other cue—without integra-

tion of two or more cues. We focused on the substitution and

not the integration aspect of Brunswik's vicarious functioning

(see Gigerenzer & Murray, 1987, pp. 66-81), in contrast to the

multiple regression metaphor of judgment. Despite its simplic-

ity, the substitution model produces zero overconfidence and a

large number of correct answers, if the PMM is well adapted.

There may be more reasons for simple substitution models.

B. Armelius and Armelius (1974), for instance, reported that

subjects were well able to use ecological validities, but not the

correlations between cues. If the latter is the case, then multiple

cue integration may not work well.

We briefly indicate here that features emphasized by PMM

theory, such as representative sampling and the confidence-fre-

quency distinction, can also be crucial for probabilistic reason-

ing in other tasks.

In several Bayesian-type studies of revision of belief, repre-

Page 20: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

PROBABILISTIC MENTAL MODELS 525

sentative (random) sampling from a reference class is a crucial

issue. For instance, Gigerenzer, Hell, and Blank (1988) showed

that subjects' neglect of base rates in Kahneman and Tversky's

(1973) engineer-lawyer problem disappeared if subjects could

randomly draw the descriptions from an urn. Similar results

showing people's sensitivity to the issue of representative versus

selected sampling have been reported by Cosmides and Tooby

(1990), Ginossar and Trope (1987), Grether (1980), Hansen and

Donoghue (1977), and Wells and Harvey (1977), but see Nisbett

and Borgida (1975).

This study has also demonstrated that judgments of single

events can systematically differ from judgments of relative fre-

quencies. Similar differences were found for other kinds of prob-

abilistic reasoning (Gigerenzer, 1991 a, 1991 b). For instance, the

"conjunction fallacy" has been established by asking subjects

the probabilities of single events, such as whether "Linda" is

more likely to be (a) a bank teller or (b) a bank teller and active

in the feminist movement. Most subjects chose the latter, be-

cause the description of Linda was constructed to be representa-

tive of an active feminist. This judgment was called a conjunc-

tion fallacy because the probability of a conjunction (bank

teller and feminist) is never larger than the probability of one of

its constituents. As in the engineer-lawyer problem, the repre-

sentativeness heuristic was proposed to explain the "fallacy."

Fiedler (1988) and Tversky and Kahneman (1983), however,

showed that the conjunction fallacy largely disappeared if peo-

ple were asked for frequencies (e.g, "There are 100 persons like

Linda. How many of them are. . . ?") rather than probabilities

of single events. Cosmides and Tooby (1990) showed a similar

striking difference for people's reasoning in a medical probabil-

ity revision problem. The subjects' task was to estimate the

probability that people have a disease, given a positive test re-

sult, the base rate of the disease, the false-alarm rate, and the hit

rate of the test. Originally, Casscells, Schoenberger, and Gray-

boys (1978) reported only 18% Bayesian answers when Harvard

medical students and staff were asked for a single-event proba-

bility (What is the probability that a person found to have a

positive result actually has the disease?). When Cosmides and

Tooby changed the task into a frequency task (How many peo-

ple who test positive will actually have the disease?), 76% of

subjects responded with the Bayesian answer. These results sug-

gest that the mental models subjects construe to solve these

reasoning problems were highly responsive to information cru-

cial for probability and statistics—random versus selected sam-

pling and single events versus frequencies in the long run.

Js Overconfidence a Bias According to Probability Theory?

Throughout this article, we have avoided classifying judg-

ments as either rational or biased, but instead focused on the

underlying cognitive processes and how these explain extant

data. Overconfidence is, however, usually classified as a bias

and dealt with in chapters on "cognitive illusions" (e.g, Edwards

& von Winterfeldt, 1986). Is Overconfidence a bias according to

probability theory?

Mathematical probability emerged around 1660 as a Janus-

faced concept with three interpretations: observed frequencies

of events, equal possibilities based on physical symmetry, and

degrees of subjective certainty or belief. Frequencies originally

came from mortality and natality data, sets of equiprobable

outcomes from gambling, and the epistemic sense of belief pro-

portioned to evidence from courtroom practices (Daston,

1988). Eighteenth-century mathematicians used "probability"

in all three senses, whereas latter-day probabilists drew a bold

line between the first 2 "objective" senses and the third "subjec-

tive" one. Today, mathematicians, statisticians, and philoso-

phers are still wrangling over the proper interpretation of proba-

bility: Does it mean a relative frequency, a propensity, a degree

of belief, a degree of evidentiary confirmation, or yet some-

thing else? Prominent thinkers can still be found in every camp,

and it would be bold unto foolhardy to claim that any interpre-

tation had a monopoly on reasonableness (Gigerenzer et al.,

1989).

Overconfidence is defined as the difference between degrees

of belief (subjective probabilities) and a relative frequency (per-

centage correct). Is a deviation between the probability that a

particular answer is correct and the relative frequency of

correct answers a bias or error, according to probability theory?

From the point of view of dedicated frequentists such as von

Mises (1928/1957) and Neyman (1977), it is not. According to

the frequentist interpretation (which is the dominant interpre-

tation in statistics departments today), probability theory is

about relative frequencies in the long run; it does not deal with

degrees of beliefs concerning single events. For instance, when

speaking of "the probability of death":

[One] must not think of an individual, but of a certain class as awhole, e.g., "all insured men forty-one years old living in a givencountry and not engaged in certain dangerous occupations".. . .The phrase "probability of death," when it refers to a single per-son, has no meaning at all for us. (von Mises, 1928/1957, p. 11)

For a frequentist, one cannot properly speak of a probability

until a reference class has been defined. The statistician Bar-

nard (1979), for instance, suggested that if one is concerned

with the subjective probabilities of single events, such as confi-

dence, one "should concentrate on the works of Freud and per-

haps Jung rather than Fisher and Neyman" (p. 171). Thus, for

frequentists, probability theory does not apply to single-event

judgments like confidences, and therefore no statement about

confidences can violate probability theory

Moreover, even subjectivists would not generally think of a

deviation between probabilities for single events and relative

frequencies as a bias. The problem is whether and when a subjec-

tivist, who rejects the identification of probability with objec-

tive frequency, should nonetheless make frequency the yard-

stick of good reasoning (for a discussion of conditions, see Ka-

dane & Lichtenstein, 1982). The subjectivist Bruno de Finetti,

for instance, emphatically stated in his early work that subjec-

tive probabilities of single events cannot be validated by objec-

tive probabilities:

However an individual evaluates the probability of a particularevent, no experience can prove him right, or wrong; nor in gen-eral, could any conceivable criterion give any objective sense to thedistinction one would like to draw, here, between right and wrong,(de Finetti, 1931/1989, p. 174)

We thus have to face a problem: Many cognitive psychologists

think of Overconfidence as a bias of reasoning, pointing to prob-

Page 21: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

526 G. GIGERENZER, U. HOFFRAGE, AND H. KLEINBOLTING

ability theory as justification. Many probabilisls and statisti-

cians, however, would reply that their interpretation of probabil-

ity does not justify this label (see Hacking, 1965; Lad, 1984;

Stegmiiller, 1973).

PMM theory can offer a partial solution to this problem.

First, it clarifies the distinction between confidence and fre-

quency judgment and therewith directs attention to the compar-

ison between estimated and true frequencies of correct an-

swers. The latter avoids the previously stated problem. This

comparison has not received much attention in research on

confidence in knowledge. Second, PMM theory proposes a fre-

quentist interpretation of degrees of belief: Both confidence

and frequency judgments are based on memory about frequen-

cies. Our view links both types of judgment but does not equate

them. Rather, it specifies when to expect confidence and fre-

quency judgments to diverge, and in what direction, and when

they will converge. PMM theory integrates single-event proba-

bilities into a frequentist framework: the Bayesian is Bruns-

wikian.

Conclusions

We conjecture that confidence in one's knowledge of the kind

studied here—immediate and spontaneous rather than a prod-

uct of long-term reflection—is largely determined by the struc-

ture of the task and the structure of a corresponding, known

environment in a person^ long-term memory. We provided ex-

perimental evidence for this hypothesis by showing how

changes in the task (confidence vs. frequency judgment) and in

the relationship between task and environment (selected vs. rep-

resentative sampling) can make the two stable effects reported

in the literature—overconfidence and the hard-easy effect—

emerge, disappear, and invert at will. We have demonstrated a

new phenomenon, the confidence-frequency effect. One can-

not speak of a general overconfidence bias anymore, in the

sense that it relates to deficient processes of cognition or moti-

vation. In contrast, subjects seem to be able to make fine con-

ceptual distinctions—confidence versus frequency—of the

same kind as probabilists and statisticians do. Earlier attempts

postulating general deficiencies in information processing or

motivation cannot account for the experimental results pre-

dicted by PMM theory and confirmed in two experiments.

PMM theory seems to be the first theory in this field that gives

a coherent account of these various effects by focusing on the

relation between the structure of the task, the structure of a

corresponding environment, and a PMM.

References

Allport, D. A. (1975). The state of cognitive psychology. Quarterly Jour-

nal of Experimental Psychology, 27, 141-152.

Allwood, C. M, & Montgomery, H. (1987). Response selection strate-

gies and realism of confidence judgments. Organizational Behavior

and Human Decision Processes, 39, 365-383.

Anderson, N. H. (1986). A cognitive theory of judgment and decision.In B. Brehmer, H. Jungerraann, P. Lourens, & G. Sev6n (Eds.), New

directions in research on decision making (pp. 63-108). Amsteidam:

North-Holland.Angele, U, Beer-Binder, B., Berger, R, Bussmann, C, KleinbOlting, H,

& Mansard, B. (1982). Uber- undUnterschtitzungdeseigenen Wissens

in Abhangigkeit von Geschlecht und Bildungsstand. [Overestimation

and underestimation of one's knowledge as a function of sex and

education]. Unpublished manuscript, University of Konstanz, Con-

stance, Federal Republic of Germany

Arkes, H. R., Christensen, C, Lai, C, & Blumer, C. (1987). Two meth-

ods of reducing overconfidence. Organizational Behavior and Hu-

man Decision Processes, 39,133-144.

Arkes, H. R., & Hammond, K. R. (Eds.). (1986). Judgment and decision

making: An interdisciplinary reader. Cambridge, England: Cam-

bridge University Press.

Armelius. E, & Armelius, K. (1974). The use of redundancy in multi-ple-cue judgments: Data from a suppressor-variable-task. American

Journal of Psychology, 87, 385-392.

Armelius, K. (1979). Task predictability and performance as determi-

nants of confidence in multiple-cue judgments. Scandinavian Jour-nal of Psychology, 20,19-25.

Barnard, G. A. (1979). Discussion of the paper by Professors Lindley

andTverskyand Dr. Brown. Journal of the Royal Statistical Society of

London, 142 (Series A), 171-172.

Bjorkman, M. (1987). A note on cue probability learning: what condi-

tioning data reveal about cue contrast. Scandinavian Journal of Psy-

chology, 28, 226-232.

Brehmer, B. (1988). The development of social judgment theory. In B.

Brehmer & C. R. B. Joyce (Eds.), Human judgment: The SJT view

(pp. 13-40). Amsterdam: North-Holland.

Brehmer, B., & Joyce, C. R. B. (Eds.). (1988). Human judgment: TheSJT view. Amsterdam: North-Holland.

Brunswik, E. (1943). Organismic achievement and environmental

probability. Psychological Review, 50, 255-272.

Brunswik, E. (1955). Representative design and probabilistic theory ina functional psychology Psychological Review, 62,193-217.

Brunswik, E. (1964). Scope and aspects of the cognitive problem. In

Contemporary approaches to cognition (pp. 4-31). Cambridge, MA:

Harvard University Press.

CassceHs, W, Schoenberger, A., & Grayboys, T. (1978). Interpretation

by physicians of clinical laboratory results. New England Journal of

Medicine, 299, 999-1000.

Cohen, L. J. (1989). The philosophy of induction and probability. Oxford,

England: Clarendon Press.

Cosmides, L, & Tooby, J. (1990, August). Is the mind a frequentist?Paper presented at the 31st Annual Meeting of the Psychonomics

Society, New Orleans, LA.

Daston, L. J. (1988). Classical probability in the Enlightenment. Prince-

ton, NJ: Princeton University Press.

Dawes, R. M. (1980). Confidence in intellectual judgments vs. confi-

dence in perceptual judgments. In E. D. Lantermann & H. Feger

(Eds.), Similarity and choice: Papers in honor of Clyde Coombs (pp.

327-345). Bern, Switzerland: Huber.

de Finetti, B. (1989). Probabilism. Erkemtnis, 31,169-223. (Original

work published 1931)

Duncker, K. (1945). On problem solving (L. S. Lees, Trans.). Psychologi-

cal Monographs, 58(5, Whole No. 270). (Original work published1935)

Edwards, W, & von Winterfeldt, D. (1986). On cognitive illusions andtheir implications. In H. R. Arkes & K. R. Hammond (Eds.), Judg-

ment and decision making: An interdisciplinary reader (pp. 642-679).

Cambridge, England: Cambridge University Press.

Einhorn, H. I, & Hogarth, R. M. (1981). Behavioral decision theory:

Processes of judgment and choice. Annual Review of Psychology, 32,

53-88.

Fiedler, K. (1988). The dependence of the conjunction fallacy on subtle

linguistic factors. Psychological Research. 50,123-129.

Fischhoff, B. (1982). Debiasing. In D. Kahneman, P. Slovic, & A.

Page 22: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

PROBABILISTIC MENTAL MODELS 527

Tversky (Eds.), Judgment under uncertainty: Heuristics and biases

(pp. 422-444). Cambridge, England: Cambridge University Press.

Fischhoff, B., & MacGregor, n (1982). Subjective confidence in fore-

casts. Journal of forecasting, 1,155-172.

Fischhoff, B., Slovic, P., & Lichtenstein, S. (1977). Knowing with cer-

tainty: The appropriateness of extreme confidence. Journaloj'Exper-

imental Psychology: Human Perception and Performance, 3, 552-

564.

Fraisse, P. (1964). The psychology of time. London: Eyre & Spottis-

woode.

Gigerenzer, G. (1984). External validity of laboratory experiments:

The frequency-validity relationship. American Journal of Psychol-

ogy, 97,185-195.

Gigerenzer, G. (1987). Survival of the fittest probabilist: Brunswik,

Thurstone, and the two disciplines of psychology. In L. Kruger, G.

Gigerenzer, & M. S. Morgan (Eds.), The probabilistic revolution, Vol.

2: Ideas in the sciences. Cambridge, MA: MIT Press.

Gigerenzer, G. (1991 a). From tools to theories: A heuristic of discovery

in cognitive psychology. Psychological Review, 98, 254-267.

Gigerenzer, G. (1991b). How to make cognitive illusions disappear:

Beyond "heuristics and biases". European Review of Social Psychol-

ogy, 2, 83-115.

Gigerenzer, G., Hell, W, & Blank, H. (1988). Presentation and content:

The use of base rates as a continuous variable. Journal of Experimen-tal Psychology: Human Perception and Performance, 14, 513-525.

Gigerenzer, G, & Murray, D. J. (1987). Cognition as intuitive statistics.Hillsdale, NJ: Erlbaum.

Gigerenzer, G, & Richter, H. R. (1990). Context effects and their inter-

action with development: Area judgments. Cognitive Development,

5, 235-264.

Gigerenzer, G., Swijtink, Z, Porter, T, Daston, L. J., Beatty, J., &

Kruger, L. (1989). The empire of chance. How probability changed

science and everyday life. Cambridge, England: Cambridge Univer-

sity Press.

Ginossar, Z., & Trope, Y (1987). Problem solving in judgment under

uncertainty. Journal of Personality and Social Psychology, 52, 464-

474.

Grether, D. M. (1980). Bayes rule as a descriptive model: The represen-tativeness heuristic. The Quarterly Journal of Economics, 95, 537-

557.

Hacking, I. (1965). Logic of statistical inference. Cambridge, England:

Cambridge University Press.

Hammond, K. R., Stewart, T. R., Brehmer, B., & Steinmann, D. O.

(1975). Social judgment theory. In M. F Kaplan & S. Schwartz (Eds),

Human judgment and decision processes. San Diego, CA: Academic

Press.

Hammond, K. R., & Wascoe, N. E. (Eds.) (1980). Realizations of

Brunswik's representative design. New Directions for the Methodol-ogy of Social and Behavioral Science, 3, 271-312.

Hansen, R. D., & Donoghue, J. M. (1977). The power of consensus:

Information derived from one's own and others' behavior. Journal of

Personality and Social Psychology, 35, 294-302.

Hasher, L., Goldstein, D., & Toppino, T. (1977). Frequency and theconference of referential validity. Journal ofVerbal Learning and Ver-

bal Behavior, /«, 107-112.

Hasher, L., & Zacks, R. T. (1979). Automatic and effortful processes in

memory. Journal of Experimental Psychology: General, 108, 356-

388.

Hintzman, D. L., Nozawa, G., & Irmscher, M. (1982). Frequency as a

nonpropositional attribute of memory. Journal ofVerbal Learning

and Verbal Behavior. 21,127-141.

Howell, W C, & Burnett, S. (1978). Uncertainty measurement: A cog-nitive taxonomy. Organizational Behavior and Human Performance,

22, 45-68.

Johnson-Laird, P. N. (1983). Mental models. Cambridge, MA: Harvard

University Press.

Juslin, P. (1991). Well-calibrated general knowledge: An ecological induc-

tive approach to realism ofconftdence. Manuscript submitted for pub-

lication. Uppsala, Sweden.

Kadane, J. B., & Lichtenstein, S. (1982). A subjectivist view of calibration

(Rep. No. 82-86). Eugene, OR: Decision Research.

Kahneman, D., &Tversky, A. (1973). On the psychology of prediction.

Psychological Review SO, 237-251.

Kahneman, D, & Tversky, A. (1982). On the study of statistical intu-

itions. In D. Kahneman, P. Slovic, & A. Tversky (EdsJ, Judgment

under uncertainty: Heuristics and biases (pp. 493-508). Cambridge,

England: Cambridge University Press.

Keren, G. (1987). Facing uncertainty in the game of bridge: A calibra-

tion study. Organizational Behavior and Human Decision Processes,

39, 98-114.

Keren, G. (1988). On the ability of monitoring non-veridical percep-

tions and uncertain knowledge: Some calibration studies. Ada Psy-

chologica.67,95-119.

Koriat, A, Lichtenstein, S., & Fischhoff, B. (1980). Reasons for confi-

dence. Journal of Experimental Psychology: Human Learning and

Memory, 6,107-118.Kyburg, H. E. (1983). Rational belief. Behavioral and Brain Sciences, 6,

231-273.

Lad, F. (1984). The calibration question. British Journal of the Philo-

sophy of Science, 35, 213-221.

Leary, D. E. (1987). From act psychology to probabilistic functiona-

lism: The place of Egon Brunswik in the history of psychology. In

M. G. Ash & W R. Woodward (Eds.), Psychology in twentieth-century

thought and society (pp. 115-142). Cambridge, England: Cambridge

University Press.

Lichtenstein, S., & Fischhoff, B. (1977). Do those who know more also

know more about how much they know? The calibration of probabil-

ity judgments. Organizational Behavior and Human Performance,

20,159-183.

Lichtenstein, S., & Fischhoff, B. (1981). The effects of gender and in-

struction on calibration (Tech. Rep. No. PTR-1092-8I-7). Eugene,

OR: Decision Research.

Lichtenstein, S., Fischhoff, B., & Phillips, L. D. (1982). Calibration of

probabilities: The state of art to 1980. In D. Kahneman, P. Slovic, &

A. Tversky (Eds.), Judgment under uncertainty: Heuristics and biases

(pp. 306-334). Cambridge, England: Cambridge University Press.

MacGregor, D.,& Slovic, P. (1986). Perceived acceptabilityofrisk analy-

sis as a decision-making approach. Risk Analysis, 6, 245-256.

May, R. S. (1986). Overconfidence as a result of incomplete and wrong

knowledge. In R. W Scholz (Ed.), Current issues in West German

decision research (pp. 13-30). Frankfurt am Main, Germany: Lang.

May, R. S. (1987). Realismus von subjektiven Wahrscheinlichkeiten: Eine

kognitionspsychologische Analyse inferentieller Prozesse beim Over-

confidence-Phanomen [Calibration of subjective probabilities: A cog-

nitive analysis of inference processes in overconfidence) Frankfurt,

Federal Republic of Germany: Lang.

Mayseless, O, & Kruglanski, A. W (1987). What makes you so sure?

Effects of epistemic motivations on judgmental confidence. Organi-zational Behavior and Human Decision Processes, 39,162-183.

Newell, A., & Simon, H. A. (1972). Human problem solving. Englewood

Cliffs, NJ: Prentice-Hall.

Neyman, J. (1977). Frequentist probability and frequentist statistics.

Synthese. 36, 97-131.

Nisbett, R. E., & Borgida, E. (1975). Attribution and the psychology of

prediction. Journal of Personality and Social Psychology, 32, 932-

943.

Parducci, A. (1965). Category judgment: A range-frequency model.Psychological Review, 72, 407-418.

Page 23: Probabilistic mental models: A Brunswikian theory of ...library.mpib-berlin.mpg.de/ft/gg/GG_Probabilistic_1991.pdf · ple are good judges of the reliability of their knowledge,

528 G. GIGERENZER, U HOFFRAGE, AND H. KLEINBOLTING

Ronis, D. L., & Yates, J. F. (1987). Components of probability judgmentaccuracy: Individual consistency and effects of subject matter and

assessment method. Organizational Behavior and Human Decision

Processes, 40,193-218.

Rosen, E. (1978). Principles of categorization. In E. Rosen & B. B.

Lloyd (Eds.), Cognition andcategorization(pp. 27-48). Hillsdale, NJ:Erlbaum.

Schdnemann, P. H. (1983). Some theory and results for metrics forbounded response scales. Journal of Mathematical Psychology, 27,

311-324.

Shafer, G. (1978). Non-additive probabilities in the work of BernoulliandLsmbert. Archive for the History of Exact Sciences, 19,309-370.

Sivyer, M., & Finlay, EH (1982). Perceived duration of auditory se-

quences. Journal of General Psychology, 107, 209-217.

Statistisches Bundesamt. (1986). Statistisches Jahrbuch 19S6 fur die

Bundesrepublik Deutschland[Statistical yearbook 1986 for the Fed-eral Republic of Germany]. Stuttgart, Germany: Kohlhammer.

Ste%m\i[lei,'W.(\913).ProbkmeundJ{esultatederWissenschaftstheorie

und Analylischen Philosophic. Ed. IV: Personelle and Statistische

Wahrscheinlichkeit. Teil E. [Problems and results of philosophy ofscience and analytical philosophy Vol. IV: Personal and statisticalprobability. Part E] Berlin, Federal Republic of Germany: Springer.

Tanner, W P., Jr., & Swets, J. A. (1954). A decision-making theory ofvisual detection. Psychological Review, 61, 401-409.

Teigen, K. H. (1974). Overestimation of subjective probabilities. Scan-dinavian Journal of Psychology, IS, 56-62.

Tversky, A, & Kahneman, D. (1983). Extensional versus intuitive rea-

soning: The conjunction fallacy in probability judgment. Psychologi-cal Review, 90, 293-315.

von Mises, R. (1928). Wahrscheinlichkeit, Statistik und Wahrheit. Ber-

lin, Germany: Springer. (Translated and reprinted as Probability sta-tistics, and truth. New York: Dover, 1957)

von Winterfeldt, D., & Edwards, W (1986). Decision analysis and behav-ioral research. Cambridge, England: Cambridge University Press.

Wagenaar, W, & Keren, G. B. (1985). Calibration of probability assess-ments by professional blackjack dealers, statistical experts, and lay

people. Organizational Behavior and Human Decision Processes, 36,406-416.

Wells, G. L, & Harvey, J. H. (1977). Do people use consensus informa-tion in making causal attributions? Journal of Personality and Social

Psychology, 35. 279-293.Zacks, R. T, Hasher, L., & Sanft, H. (1982). Automatic encoding of

event frequency: Further findings. Journal of Experimental Psychol-ogy: Learning, Memory, andCognition, S, 106-116.

Zimmer, A. C. (1983). Verbal vs. numerical processing by subjectiveprobabilities. In R. W Scholz (Ed.), Decision making under uncer-

tainty (pp. 159-182). Amsterdam: North-Holland.Zimmer, A. C. (1986). What uncertainty judgments can tell about the

underlying subjective probabilities. Uncertainty in Artificial Intelli-gence, 249-258.

Received March 12,1990

Revision received December 18,1990

Accepted December 30,1990 •

1992 APA Convention "Call for Programs"

The "Call for Programs" for the 1992 APA annual convention will be included in the October

issue of the APA Monitor. The 1992 convention will be held in Washington, DC, from August 14

through August 18. Deadline for submission of program and presentation proposals is De-

cember 13, 1991. Additional copies of the "Call" will be available from the APA Convention

Office in October.