Upload
dangnhan
View
217
Download
4
Embed Size (px)
Citation preview
Psychological Review1991. Vol. 98, No. 4, 506-528
Copyright 1991 by the American Psychological Association. Inc.0033-295X/91/J3.00
Probabilistic Mental Models: A Brunswikian Theory of Confidence
Gerd GigerenzerCenter for Advanced Study in the Behavioral Sciences
Stanford, California
Ulrich Hoffrage and Heinz KleinbfiltingUniversity of Salzburg, Salzburg, Austria
Research on people's confidence in their general knowledge has to date produced two fairly stableeffects, many inconsistent results, and no comprehensive theory. We propose such a comprehensiveframework, the theory of probabilistic mental models (PMM theory). The theory (a) explains both
the overconfidence effect (mean confidence is higher than percentage of answers correct) and thehard-easy effect (overconfidence increases with item difficulty) reported in the literature and (b)predicts conditions under which both effects appear, disappear, or invert. In addition, (c) it predicts
a new phenomenon, the confidence-frequency effect, a systematic difference between a judgmentof confidence in a single event (i.e., that any given answer is correct) and a judgment of the frequencyof correct answers in the long run. Two experiments are reported that support PMM theory byconfirming these predictions, and several apparent anomalies reported in the literature are ex-plained and integrated into the present framework.
Do people think they know more than they really do? In the
last 15 years, cognitive psychologists have amassed a large and
apparently damning body of experimental evidence on over-
confidence in knowledge, evidence that is in turn part of an
even larger and more damning literature on so-called cognitive
biases. The cognitive bias research claims that people are natu-
rally prone to making mistakes in reasoning and memory, in-
cluding the mistake of overestimating their knowledge. In this
article, we propose a new theoretical model for confidence in
knowledge based on the more charitable assumption that peo-
ple are good judges of the reliability of their knowledge, pro-
vided that the knowledge is representatively sampled from a
specified reference class. We claim that this model both pre-
dicts new experimental results (that we have tested) and ex-
plains a wide range of extant experimental findings on confi-
dence, including some perplexing inconsistencies.
Moreover, it is the first theoretical framework to integrate the
two most striking and stable effects that have emerged from
confidence studies—the overconfidence effect and the hard-
easy effect—and to specify the conditions under which these
effects can be made to appear, disappear, and even invert. In
most recent studies (including our own, reported herein), sub-
This article was written while Gerd Gigerenzer was a fellow at theCenter for Advanced Study in the Behavioral Sciences, Stanford,California. We are grateful for financial support provided by theSpencer Foundation and the Deutsche Foischungsgemeinschaft(DFG 170/2-1).
We thank Leda Cosmides, Lorraine Daston, Baruch FischhofF, Jen-nifer Freyd, Kenneth Hammond, Wolfgang Hell, Sarah Licmenstein,Kathleen Much, John Tooby, Amos Tversky, and an anonymous re-viewer for helpful comments on earlier versions of this article.
Correspondence concerning this article should be addressed toGerd Gigerenzer, who is now at the Institut fur Psychologic, Hellbrun-nerstrasse 34, Universitat Salzburg, A-5020 Salzburg, Austria.
jects are asked to choose between two alternatives for each of a
series of general-knowledge questions. Here is a typical exam-
ple: "Which city has more inhabitants? (a) Hyderabad or (b)
Islamabad." Subjects choose what they believe to be the correct
answer and then are directed to specify their degree of confi-
dence (usually on a 50%-100% scale) that their answer is indeed
correct. After the subjects answer many questions of this sort,
the responses are sorted by confidence level, and the relative
frequencies of correct answers in each confidence category are
calculated. The overconfidence effect occurs when the confi-
dence judgments are larger than the relative frequencies of the
correct answers; the hard-easy effect occurs when the degree of
overconfidence increases with the difficulty of the questions,
where the difficulty is measured by the percentage of correct
answers.
Both effects seem to be stable. Fischhoff (1982) reviewed the
attempts to eliminate overconfidence by numerous "debiasing
methods," such as giving rewards, clarifying instructions, warn-
ing subjects in advance about the problem, and using better
response modes—all to no avail. He concluded that these ma-
nipulations "have so far proven relatively ineffective," and that
overconfidence was "moderately robust" (p. 440). von Winter-
feldt and Edwards (1986, p. 539) agreed that "overconfidence is
a reliable, reproducible finding." Yet these robust phenomena
still await a theory. In particular, we lack a comprehensive theo-
retical framework that explains both phenomena, as well as the
various exceptions reported in the literature, and integrates the
several local explanatory attempts already advanced. That is the
aim of this article. It consists of four parts: (a) an exposition of
the proposed theory of probabilistic mental models (PMM
theory), including predictions of new experimental findings
based on the theory; (b) a report of our experimental tests con-
firming these predictions; (c) an explanation of apparent anoma-
lies in previous experimental results, by means of PMMs; and
(d) a concluding discussion.
506
PROBABILISTIC MENTAL MODELS 507
PMM Theory
This theory deals with spontaneous confidence—that is,with an immediate reaction, not the product of long reflection.Figure 1 shows a flow chart of the processes that generate confi-dence judgments in two-alternative general-knowledge tasks.'There are two strategies. When presented with a two-alterna-tive confidence task, the subject first attempts to constructwhat we call a local mental model (local MM) of the task. This isa solution by memory and elementary logical operations. If thisfails, a PMM is constructed that goes beyond the structure ofthe task in using probabilistic information from a natural envi-ronment.
For convenience, we illustrate the theory using a problemfrom the following experiments: "Which city has more inhabit-ants? (a) Heidelberg or (b) Bonn." As explained earlier, the sub-jects' task is to choose a or b and to give a numerical judgmentof their confidence (that the answer chosen is correct).
Local MM
We assume that the mind first attempts a direct solution thatcould generate certain knowledge by constructing a local MM.For instance, a subject may recall from memory that Heidel-berg has a population between 100,000 and 200,000, whereasBonn has more than 290,000 inhabitants. This is already suffi-cient for the answer "Bonn" and a confidence judgment of100%. In general, a local MM can be successfully constructed if(a) precise figures can be retrieved from memory for both alter-natives, (b) intervals that do not overlap can be retrieved, or (c)elementary logical operations, such as the method of exclusion,can compensate for missing knowledge. Figure 2 illustrates asuccessful local MM for the previous example. Now consider atask where the target variable is not quantitative (such as thenumber of inhabitants) but is qualitative: "If you see the nation-ality letter P on a car, is it from Poland or Portugal?" Here,either direct memory about the correct answer or the method ofexclusion is sufficient to construct a local MM. The latter isillustrated by a subject reasoning "Since I know that Poland hasPL it must be Portugal" (Allwood & Montgomery, 1987, p. 370).
The structure of the task must be examined to define moregenerally what is referred to as a local MM. The task consists oftwo objects, a and b (alternatives), and a target variable t. First, alocal MM of this task is local; that is, only the two alternativesare taken into account, and no reference class of objects is con-structed (see the following discussion). Second, it is direct; thatis, it contains only the target variable (e.g, number of inhabit-ants), and no probability cues are used. Third, no inferencesbesides elementary operations of deductive logic (such as exclu-sion) occur. Finally, if the search is successful, the confidence inthe knowledge produced is evaluated as certain. In these re-spects, our concept of a local MM is similar to what Johnson-Laird (1983, pp. 134-142) called a "mental model" in syllogisticinference.
A local MM simply matches the structure of the task; there isno use of the probability structure of an environment and, con-sequently, no frame for inductive inference as in a PMM. Be-cause memory can fail, the "certain" knowledge produced cansometimes be incorrect. These failures contribute to the
amount of overconfidence to be found in 100%-confident judg-ments.
PMM
Local MMs are of limited success in general-knowledgetasks2 and in most natural environments, although they seem tobe sufficient for solving some syllogisms and other problems ofdeductive logic (see Johnson-Laird, 1983). If no local MM canbe activated, it is assumed that a PMM is constructed next. APMM solves the task by inductive inference, and it does so byputting the specific task into a larger context. A PMM connectsthe specific structure of the task with a probability structure ofa corresponding natural environment (stored in long-termmemory). In our example, a natural environment could be theclass of all cities in Germany with a set of variables defined onthis class, such as the number of inhabitants. This task selectsthe number of inhabitants as the target and the variables thatcovary with this target as the cues.
A PMM is different from a local MM in several respects.First, it contains a reference class of objects that includes theobjects a and b. Second, it uses a network of variables in addi-tion to the target variable for indirect inference. Thus, it isneither local nor direct. These two features also change thethird and fourth aspects of a local MM. Probabilistic inferenceis part of the cognitive process, and uncertainty is part of theoutcome.
Reference Class
We use Brunswik's (1943, p. 257) term reference class to de-fine the class of objects or events that a PMM contains. In ourexample, the reference class "all cities in Germany" may begenerated. To generate a reference class means to generate a setof objects known from a person's natural environment that con-tains objects a and b.
The reference class determines which cues can function asprobability cues for the target variable and what their cue vali-dities are. For instance, a valid cue in the reference class "allcities in Germany" would be the soccer-team cue; that is,whether a city's soccer team plays in the German soccer Bun-desliga, in which the 18 best teams compete. Cities with moreinhabitants are more likely to have a team in the Bundesliga.The soccer-team cue would not help in the Hyderabad-Islama-bad task, which must be solved by a PMM containing a differ-ent reference class with different cues and cue validities.
Probability Cues
A PMM for a given task contains a reference class, a targetvariable, probability cues, and cue validities. A variable is a
1 For convenience, the theory is presented here in its complete form,
although parts of it were developed after Experiment 1 was performed.
All those parts were subjected to an independent test in Experiment 2.2 Allwood and Montgomery (1987, pp. 369-370) estimated from ver-
bal protocols that about 19% of their general-knowledge questions
were solved by "full recognition," which seems to be equivalent to
memory and elementary logical operations only.
508 G. GIGERENZER, U. HOFFRAGE, AND H. KLEINBOLTING
STRUCTURE OF TASK
objects: a and b
variable: t (target)
problem: a > 6 on r?confidence?
environment;generate cues and
ue validities
MENTAL MODELS OF TASK
local MM
objects: a and 6
variable: t (target)
solutionstrategy: retrieval from memory
and logical operations
STRUCTURE OF PERCEIVEDENVIRONMENT
PMM
objects: reference class R(a,6,£ R)
variables: t (target), cues,cue validities
solutionstrategy: inductive inference
known environment
objects: reference class R
variables: t (target), cues,ecological validities
generate cuehighest in cue
validity
cue generation and testing cycle
Figure 1. Cognitive processes in solving a two-alternative general-knowledge task.(MM = mental model; PMM = probabilistic mental model.)
probability cue C, (for a target variable in a reference class R) if
the probability p(a) of a being correct is different from the con-
ditional probability of a being correct, given that the values of a
and b differ on C,. If the cue is a binary variable such as the
soccer-team cue, this condition can be stated as follows:
p(a) + p(a\aCtb;R),
where aC,b signifies the relation of a and b on the cue C, (e.g., a
has a soccer team in the Bundesliga, but b does not) and
p(a\aCib; R) is the cue validity of Cf in R.
Thus, cue validities are thought of as conditional probabili-
ties, following Rosch (1978) rather than Brunswik (1955), who
defined his "cue utilizations" as Pearson correlations. Condi-
tional probabilities need not be symmetric as correlations are.
This allows the cue to be a better predictor for the target than
the target is for the cue, or vice versa. Cue validity is a concept
in the PMM, whereas the corresponding concept in the environ-
ment is ecological validity (Brunswik, 1955), which is the true
relative frequency of any city having more inhabitants than any
other one in R \faCfi. For example, consider the reference class
all cities in Germany with more than 100,000 inhabitants. The
ecological validity of the soccer-team cue here is .91 (calculated
STRUCTUREOP TASK
objects:
variable:
problem:
a and b
t (target)
a>bon r?
confidence?
LOCALMENTAL MODEL
| — a — | | — b -^
100,000 200,000 300,000 target
(number of inhabitants)
choose "6";
say "100%
confident"
Figure 2. Local mental model of a two-alternative general-knowledge task.
PROBABILISTIC MENTAL MODELS 509
for 1988/1989 for what then was West Germany). That is, if onechecked all pairs in which one city a has a team in the Bundes-liga but the other city b does not, one would find that in 91 % ofthese cases city a has more inhabitants.
Vicarious Functioning
Probability cues are generated, tested, and if possible, acti-vated. We assume that the order in which cues are generated isnot random; in particular, we assume that the order reflects thehierarchy of cue validities. For the reference class all cities inGermany, the following cues are examples that can be gener-ated: (a) the soccer-team cue; (b) whether one city is a statecapital and the other is not (state capital cue); (c) whether onecity is located in the Ruhrgebiet, the industrial center of Ger-many, and the other in largely rural Bavaria (industrial cue); (d)whether the letter code that identifies a city on a license plate isshorter for one city than for the other (large cities are usuallyabbreviated by only one letter, smaller cities by two or three;license plate cue); and (e) whether one has heard of one city andnot of the other (familiarity cue). Consider now the Heidelberg-Bonn problem again. The first probability cue is generated andtested to see whether it can be activated for that problem. Be-cause neither of the two cities has a team in the Bundesliga, thefirst cue does not work.
In general, with a binary cue and the possibility that thesubject has no knowledge, there are nine possibilities (see Figure3). In only two of these can a cue be activated. In all other cases,the cue is useless (although one could further distinguish be-tween the four known-unknown cases and the three remainingcases). If a cue cannot be activated, then a further cue is gener-ated and tested. In the Heidelberg-Bonn task, none of the fivecues cited earlier can in fact be activated. Finally, one cue maybe generated that can be activated, such as whether one city isthe capital of the country and the other is not (capital cue). This
City a
Soccer team in Bundesliga?
yes no unknown
yes
no
unknown
Figure 3. Two conditions in which a cue can be activated.
cue has a small probability of being activated—a small activa-tion rate in R (because it applies only to pairs that includeBonn)—and it does not have a particularly high cue validity inR because it is well-known that Bonn is not exactly London orParis.
The Heidelberg-Bonn problem illustrates that probabilitycues may have small activation rates in R, and as a consequence,several cues may have to be generated and tested before one isfound that can be activated. The capital cue that can be acti-vated for the Heidelberg-Bonn comparison may fail for thenext problem, for instance a Heidelberg-Gottingen compari-son. Cues can substitute for one another from problem to prob-lem, a process that Brunswik (1955) called "vicarious func-tioning."
End of Cue Generation and Testing Cycle
If (a) the number of problems is large or other kinds of timepressure apply and (b) the activation rate of cues is rather small,then one can assume that the cue generation and testing cycleends after the first cue that can be activated has been found.Both conditions seem to be typical for general-knowledge ques-tions. For instance, even when subjects were explicitly in-structed to produce all possible reasons for and against eachalternative, they generated only about three on the average andfour at most (Koriat, Lichtenstein, &Rschhoff, 1980). If no cuecan be activated, we assume that choice is made randomly, and"confidence 50%" is chosen.
Choice of Answer and Confidence Judgment
Choice of answer and confidence judgment are determinedby the cue validity Choice follows the rule:
choose a ifp(a\aCib; R) > p(b\aC,b; R).
If a is chosen, the confidence that a is correct is given by thecue validity:
p(a\aCib; R).
Note that the assumption that confidence equals cue validityis not arbitrary; it is both rational and simple in the sense thatgood calibration is to be expected if cue validities correspond toecological validities. This holds true even if only one cue isactivated.
Thus, choice and confidence are inferred from the same acti-vated cue. Both are expressions of the same conditional proba-bility. Therefore, they need not be generated in the temporalsequence choice followed by confidence. The latter is, of course,typical for actual judgments and often enforced by the instruc-tions in confidence studies.
Confidence in the Long Run and Confidence in Single
Events
Until now, only confidence in single events—such as the an-swer "Bonn" is correct—has been discussed. Confidence inone's knowledge can also be expressed with respect to se-quences of answers or events, such as "How many of the last 50questions do you think you answered correctly?" This distinc-
510 G. GIGERENZER, U. HOFFRAGE, AND H. KLEINBOLTING
tion is parallel to that between probabilities of single events and
relative frequencies in the long run—a distinction that is funda-
mental to all discussions on the meaning of probability (see
Gigerenzer et al., 1989). Probabilities of single events (confi-
dences) and relative frequencies are not the same for many
schools of probability, and we argue that they are not evaluated
by the same cognitive processes either.
Consider judgments of frequency. General-knowledge tasks
that involve a judgment of the frequency of correct answers
(frequency tasks) can rarely be answered by constructing a local
MM. The structure of the task contains one sequence of N
questions and answers, and the number of correct answers is the
target variable. Only limiting cases, such as small N (i.e, if only a
few questions are asked) combined with the belief that all an-
swers were correct, may allow one to solve this task by a local
MM. Again, to construct a local MM of the task means that the
mental model consists of only the local sequence of total N
answers (no reference class), and because one attempts to solve
the task by direct access to memory about the target variable,
no network of probability cues is constructed.
Similarly, a PMM of a frequency task is different from a
PMM of a confidence task. A confidence task about city size in
Germany has "cities in Germany" as a reference class; however,
a task that involves judgments of frequencies of correct answers
in a series of TV questions about city size has a different reference
class: Its reference class will contain series of similar questions
in similar testing situations. Because the target variable also
differs (number of correct answers instead of numberof inhabit-
ants), the PMM of a frequency task will also contain different
cues and cue validities. For instance, base rates of performance
in earlier genera] knowledge or similar testing situations could
serve as a probability cue for the target variable. Again, our
basic assumption is that a PMM connects the structure of the
task with a known structure of the subject's environment.
Table 1 summarizes the differences between PMMs that are
implied by the two different tasks. Note that in our account,
both confidences in a single event and judgments of frequency
are explained by reference to experienced frequencies. How-
ever, these frequencies relate to different target variables and
reference classes. We use this assumption to predict systematic
differences between these kinds of judgments.
Adaptive PMMs and Representative Sampling
A PMM is an inductive device that uses the "normal" lifeconditions in known environments as the basis for induction.
How well does the structure of probability cues defined on R in
a PMM represent the actual structure of probability cues in the
environment? This question is also known as that of "proper
cognitive adjustment" (Brunswik, 1964, p. 22). If the hierarchy
of cues and their validities corresponds to that of the ecological
validities, then the PMM is well adapted to a known environ-
ment. In Brunswik's view, cue validities are learned by observ-
ing the frequencies of co-occurrences in an environment.
A large literature exists that suggests that (a) memory is often
(but not always) excellent in storing frequency information
from various environments and (b) the registering of event oc-
currences for frequency judgments is a fairly automatic cogni-
tive process requiring very little attention or conscious effort
(e.g, Gigerenzer, 1984; Hasher, Goldstein, & Toppino, 1977;
Howell & Burnett, 1978; Zacks, Hasher, &Sanft, 1982). Hasher
and Zacks (1979) concluded that frequency of occurrence, spa-tial location, time, and word meaning are among the few
aspects of the environment that are encoded automatically and
that encoding of frequency information is "automatic at least in
part because of innate factors" (p. 360). In addition, Hintzman,
Nozawa, and Irmscher (1982) proposed that frequencies arestored in memory in a nonnumerical analog mode.
Whatever the mechanism of frequency encoding, we use the
following assumption for deriving our predictions: If subjects
had repeated experience with a reference class, a target variable,
and cues in their environment, we assume that cue validities
correspond well to ecological validities. (This holds true for the
average in a group of subjects, but individual idiosyncrasies in
learning the frequency structure of the environment may oc-
cur.) This is a bold assumption made in ignorance of potential
deviations between specific cue validities and ecological validi-ties. If such deviations existed and were known, predictions by
PMM theory could be improved. The assumption, however,
derives support from both the literature on automatic fre-quency processing and a large body of neo-Brunswikian re-
search on the correspondence between ecological validities andcue utilization (the latter of which corresponds to our cue vali-
dities; e.g., Arkes & Hammond, 1986; K. Armelius, 1979;
BrehmerA Joyce, 1988; MacGregor & Slovic, 1986).Note that this adaptiveness assumption does not preclude
that individuals (as well as the average subject) err. Errors canoccur even if a PMM is highly adapted to a given environment.
For instance, if an environment is changing or is changed in the
laboratory by an experimenter, an otherwise well-adaptedPMM may be suboptimal in a predictable way
Tablel
Probabilistic Mental Models for Confidence Task Versus Frequency Task: Differences Between
Target Variables, Reference Classes, and Probability Cues
PMM Confidence task Frequency task
Target variable
Reference class
Probability cues
Number of inhabitantsCities in Germany
For example, soccer-team cue
or state capital cue
Number of correct answers
Sets of general-knowledge questionsin similar testing situations
For example, base rates of previousperformance or averageconfidence in N answers
Note. For illustration, questions of the Heidelberg-Bonn type are used. PMM = probabilistic mental
model.
PROBABILISTIC MENTAL MODELS 511
Brunswik's notion of "representative sampling" is important
here. If a person experienced a representative sample of objects
from a reference class, one can expect his or her PMM to be
better adapted to an environment than if he or she happened to
experience a skewed, unrepresentative sample.
Representative sampling is also important in understanding
the relation between a PMM and the task. If a PMM is well
adapted, but the set of objects used in the task (questions) is not
representative of the reference class in the environment, perfor-
mance in tasks will be systematically suboptimal.
To avoid confusion with terms such as calibration, we will use
the term adaptation only when we are referring to the relation
between a PMM and a corresponding environment—not, how-
ever, for the relation between a PMM and a task.
Predictions
A concrete example can help motivate our first prediction.
Two of our colleagues, K and O, are eminent wine tasters. K
likes to make a gift of a bottle of wine from his cellar to friend
O, on the condition that O guesses what country or region the
grapes were grown in. Because O knows the relevant cues, O
can usually pick a region with some confidence. O also knows
that K. sometimes selects a quite untypical exemplar from his
ample wine cellar to test Friend O's limits. Thus, for each indi-
vidual wine, O can infer the probability that the grapes ripened
in, say, Portugal as opposed to South Africa, with considerable
confidence from his knowledge about cues. In the long run,
however, O nevertheless expects the relative frequency of
correct answers to be lower because K occasionally selects un-
usual items.Consider tests of general knowledge, which share an impor-
tant feature with the wine-tasting situation: Questions are se-
lected to be somewhat difficult and sometimes misleading. This
practice is common and quite reasonable for testing people's
limits, as in the wine-tasting situation. Indeed, there is appar-
ently not a single study on confidence in knowledge where a
reference class has been defined and a representative (or ran-
dom) sample of general-knowledge questions has been drawn
from this population. For instance, consider the reference class
"metropolis" and the geographical north-south location as the
target variable. A question like "Which city is farther north? (a)
New York or (b) Rome" is likely to appear in a general-knowl-
edge test (almost everyone gets it wrong), whereas a compari-
son between Berlin and Rome is not.
The crucial point is that confidence and frequency judg-
ments refer to different kinds of reference classes. A set of ques-
tions can be representative with respect to one reference class
and, at the same time, selected with respect to the other class.
Thus, a set of 50 general-knowledge questions of the city type
may be representative for the reference class "sets of general-
knowledge questions" but not for the reference class "cities in
Germany" (because city pairs have been selected for being dif-ficult or misleading). Asking for a confidence judgment sum-
mons up a PMM on the basis of the reference class "cities in
Germany;" asking for a frequency judgment summons up a
PMM on the basis of the reference class "sets of general-knowl-
edge questions." The first prediction can now be stated.
1. Typical general-knowledge tasks elicit both overconfidence
and accurate frequency judgments. By "typical" general-knowl-
edge tasks we refer to a set of questions that is representative for
the reference class "sets of general-knowledge questions."
This prediction is derived in the following way: If (a) PMMs
for confidence tasks are well adapted to an environment con-
taining a reference class R (e.g., all cities in Germany) and (b) the
actual set of questions is not representative for R, but selected
for difficult pairs of cities, then confidence judgments exhibit
overconfidence. Condition A is part of our theory (the simplify-
ing assumption we just made), and Condition B is typical for
the general-knowledge questions used in studies on confidence
as well as in other testing situations.
If (a) PMMs for frequency-of-correct-answer tasks are well
adapted with respect to an environment containing a reference
class K (e.g, the set of all general-knowledge tests experienced
earlier), and (b) the actual set of questions is representative for
K, then frequency judgments are expected to be accurate.
Again, Condition A is part of our theory, and Condition B will
be realized in our experiments by using a typical set of general-
knowledge questions.
Taken together, the prediction is that the same person will
exhibit overconfidence when asked for the confidence that a
particular answer is correct and accurate estimates when asked
for a judgment of frequency of correct answers. This prediction
is shown by the two points on the left side of Figure 4. This
prediction cannot be derived from any of the previous accounts
of overconfidence.
To introduce the second prediction, we return to the wine-
tasting story Assume that K changes his habit of selecting un-
usual wines from his wine cellar, and instead buys a representa-
tive sample of French red wines and lets O guess from what
over-estimation
under-estimation
confidence
frequency
selected set representative set
Figure 4. Predicted differences between confidence and frequencyjudgments (confidence-frequency effect).
512 G. GIGERENZER, U. HOFFRAGE, AND H. KLEINBOLTING
region they come. However, K does not tell O about the new
sampling technique. O's average confidence judgments will
now be close to the proportion of correct answers. In the long
run, O nevertheless expects the proportion of correct answers
to be less, still assuming the familiar testing situation in which
wines were selected, not randomly sampled. Thus, O's fre-
quency judgments will show underestimation.
Consider now a set of general-knowledge questions that is a
random sample from a denned reference class in the subject's
natural environment. We use the term natural environment to
denote a knowledge domain familiar to the subjects participat-
ing in the study. This is a necessary (although not sufficient)
condition to assume that PMMs are, on the average, well
adapted. In the experiments reported herein, we used West
German subjects and the reference class "all cities with more
than 100,000 inhabitants in West Germany" (The study was
conducted before the unification of Germany) The second pre-
diction is about this situation:
2. If the set of general-knowledge tasks is randomly sampled
from a natural environment, we expect overconfidence to be zero,
but frequency judgments to exhibit underestimation. Derivation
is as before: If PMMs for confidence tasks are well adapted with
respect to R, and the actual set of questions is a representative
sample from R, then overconfidence is expected to disappear. If
PMMs for frequency-of-correct-answers tasks are well adapted
with respect to K, and the actual set of questions is not represen-
tative for R, then frequency judgments are expected to be un-
derestimations of true frequencies.
Again, this prediction cannot be derived from earlier ac-
counts. Figure 4 shows Predictions I and 2. The predicted dif-
ferences between confidence and frequency judgments is re-
ferred to as the confidence-frequency effect.
Testing these predictions also allows for testing the assump-
tion of well-adapted PMMs for the confidence task. Assume
that PMMs are not well adapted. Then a representative sample
of city questions should not generate zero overconfidence but
rather over- or underconfidence, depending on whether cue
validities overestimate or underestimate ecological validities.
Similarly, if PMMs for frequency judgments are not well
adapted, frequency judgments should deviate from true fre-
quencies in typical general-knowledge tasks. Independent of
the degree of adaptation, however, the confidence-frequency
eifect should emerge, but the curves in Figure 4 would be trans-
posed upward or downward.
We turn now to the standard way in which overconfidence
has been demonstrated in previous research, comparing confi-
dence levels with relative frequencies of correct answers at each
confidence level. This standard comparison runs into a con-
ceptual problem well-known in probability theory and statis-
tics: A discrepancy between subjective probabilities in single
events (i.e., the confidence that a particular answer is correct)
and relative frequencies in the long run is not a bias in the sense
of a violation of probability theory, as is clear from several
points of view within probability theory. For instance, for a
frequentist such as Richard von Mises (1928/1957), probability
theory is about frequencies (in the long run), not about single
events. According to this view, the common interpretation of
overconfidence as a bias is based on comparing apples with
oranges. What if that conceptual problem is avoided and, in-
stead, the relative frequency of correct answers in each confi-
dence category is compared with the estimated relative fre-
quency in each confidence category? PMM theory makes an
interesting prediction for this situation, following the same rea-
soning as for the frequency judgments in Predictions 1 and 2
(which were estimated frequency-of-correct answers in a series
of N questions, whereas estimated relative frequencies in each
confidence category are the concern here):
3. Comparing estimated relative frequencies with true relative
frequencies of correct answers makes overestimation disappear.
More precisely, if the set of general-knowledge questions is se-
lected, over- or underestimation is expected to be zero; if the set
is randomly sampled, underestimation is expected. Thus,
PMM theory predicts that the distinction between confidence
and relative frequency is psychologically real, in the sense that
subjects do not believe that a confidence judgment of X% im-
plies a relative frequency of X%, and vice versa. We know of no
study on overconfidence that has investigated this issue. Most
have assumed instead that there is, psychologically, no differ-
ence.
Prediction 4 concerns the hard-easy effect, which says that
overconfidence increases when questions get more difficult
(e.g., Lichtenstein & Fischhoff, 1977). The effect refers to confi-
dence judgments only, not to frequency judgments. On our ac-
count, the hard-easy effect is not simply a function of difficulty.
Rather, it is a function of difficulty and a separate dimension,
selected versus representative sampling. (Note that the terms
hard and easy refer to the relative difficulty of two samples of
items, whereas the terms selected and representative refer to the
relation between one sample and a reference class in the per-
son's environment.) PMM theory specifies conditions under
which the hard-easy effect occurs, disappears, and is reversed.
A reversed hard-easy effect means that overconfidence de-
creases when questions are more difficult.
In Figure 5, the line descending from H to E represents a
hard-easy effect: Overconfidence in the hard set is larger than
in the easy set. The important distinction (in addition to hard
vs. easy) is whether a set was obtained by representative sam-
pling or was selected. For instance, assume that PMMs are well
adapted and that two sets of tasks differing in percentage
correct (i.e., in difficulty) are both representative samples from
their respective reference classes. In this case, one would expect
all points to be on the horizontal zero-overconfidence line in
Figure 5 and the hard-easy effect to be zero. More generally:
4. If two sets, hard and easy, are generated by the same sam-
pling process (representative sampling or same deviation from
representative), the hard-easy effect is expected to be zero. If sam-
pling deviates in both the hard and the easy set equally from
representative sampling, points will lie on a horizontal line par-
allel to the zero-overconfidence line.
Now consider the case that the easy set is selected from a
corresponding reference class (e.g, general-knowledge ques-
tions), but the hard set is a representative sample from another
reference class (denoted as H in Figure 5). One then would
predict a reversal of the hard-easy effect, as illustrated in Figure
5 by the double line from E to H.
5. If there are two sets, one is a representative sample from a
reference class in a natural environment, the other is selected from
another reference class for being difficult, but the representative
over/underconfidence
PROBABILISTIC MENTAL MODELS
.H
513
.00
hard set easy set
Figure 5, Predicted reversal of the hard-easy effect. (H = hard; E = easy).
percentagecorrect
set is harder than the selected set; then the hard-easy effect is
reversed.
In the next section, Predictions 1,2, and 3 are tested in twoexperiments; in the Explaining Anomalies in the Literature sec-tion, Predictions 4 and 5 are checked against results in theliterature.
Experiment 1
Method
Two sets of questions were used, which we refer to as the representa-tive and the selected set. The representative set was determined in thefollowing way. We used as a reference class in a natural environment (an
environment known to our subjects) the set of all cities in West Ger-many with more than 100,000 inhabitants. There were 65 cities (Statis-tisches Bundesamt, 1986). From this reference class, a random sampleof 25 cities was drawn, and all pairs of cities in the random sample wereused in a complete paired comparison to give 300 pairs. No selection
occurred. The target variable was the number of inhabitants, and the300 questions were of the following kind: "Which city has more inhabit-ants? (a) Solingen or (b) Heidelberg." We chose city questions for tworeasons. First, and most important, this content domain allowed for aprecise definition of a reference class in a natural environment and forrandom sampling from this reference class. The second reason was for
comparability. City questions have been used in earlier studies on over-confidence (e.g., Keren, 1988; May, 1987).
In addition to the representative set, a typical set of general-knowl-
edge questions, as in previous studies, was used. This selected set of 50
general-knowledge questions was taken from an earlier study (Angele
et al, 1982). Two examples are "Who was born first? (a) Buddha or (b)Aristotle" and "When was the zipper invented? (a) before 1920 or (b)
after 1920."
After each answer, the subject gave a confidence judgment (that this
particular answer was correct). Two kinds of frequency judgmentswere used. First, after each block of 50 questions, the subject estimated
the number of correct answers among the 50 answers given. Because
there were 350 questions, every subject gave seven estimates of thenumber of correct answers. Second, after the subjects answered all
questions, they were given an enlarged copy of the confidence scale
used throughout the experiment and were asked for the following fre-quency judgment: "How many of the answers that you classified into acertain confidence category are correct? Please indicate for every cate-
gory your estimated relative frequency of correct answers."
In Experiment 1, we also introduced two of the standard manipula-tions in the literature. The first was to inform and warn half of our
subjects of the overconfidence effect, and the second was to offer half
of each group a monetary incentive for good performance. Both areamong a list of "debiasing" methods known as being relatively ineffec-tive (Fischhoff, 1982), and both contributed to the view that overconfi-dence is a robust phenomenon. If PMM theory is correct, the magni-tude of effects resulting from the two manipulations—confidenceversus frequency judgment and selected versus representative sam-pling—should be much larger than those resulting from the "debias-ing" manipulations.
Subjects. Subjects were 80 students (43 men and 37 women) at the
514 G. GIOERENZER, U HOFFRAGE, AND H. KLEINBOLTING
University of Konstanz who were paid for participation. Eighty-five
percent of them grew up in the state of Baden-Wurttemberg, so thegroup was fairly homogeneous (knowledge about city populations of-ten depends on the rater's geographical location). Subjects were tested
in small groups of a maximum of 12 persons.Design and procedure. This was a 2 X 2 X 2 design with representa-
tive-selected set varied within subjects, and information-no informa-
tion about overconfidence and monetary incentive-no incentive asindependent variables varied between subjects. Half of the subjectsanswered the representative set first; the other half, the selected set.Order of questions was determined randomly in both sets.
The confidence scale consisted of seven categories, 50%, 51%-60%,61%-70%, 71%-80%, 81%-90%, 91%-99%, and 100% confident. The50%- and 100%-confidence values were introduced as separate catego-
ries because previous research showed that subjects often tend to usethese particular values. Subjects were told first to mark the alternativethat seemed to be the correct one, and then to indicate with a second
cross their confidence that the answer was correct. If they onlyguessed, they should cross the 50% category; if they were absolutelycertain, they should cross the 100% category. We explained that one of
the alternatives was always correct. In the information condition, sub-jects received the following information: "Most earlier studies found a
systematic tendency to overestimate one's knowledge; that is, therewere many fewer answers correct than one would expect from the con-fidence ratings given. Please keep this warning in mind." In the incen-
tive condition, subjects were promised 20 German marks (or a bottle ofFrench champagne), in addition to the payment that everyone received(7.50 marks), for the best performance in the group.
To summarize, 350 questions were presented, with a confidence
judgment after each question, a frequency judgment after each 50 ques-tions, and a judgment of relative frequencies of correct answers in each
confidence category at the end.
For comparison with the literature on calibration, we used the follow-ing measure:
over- or underconfidence = — ' p, - f() = p - ?,
where n is the total number of answers, nt is the number of times the
confidence judgment pt was used, and ft is the relative frequency ofcorrect answers for all answers assigned confidence p/. I is the numberof different confidence categories used (/ = 7), and p and /are the
overall mean confidence judgment and percentage correct, respec-tively. A positive difference is called overconfidence. For convenience,
we report over- and underconfidence in percentages (X100).
Results
Prediction I. PMM theory predicts that in the selected set
(general-knowledge questions), people show overestimation in
confidence judgments (overconfidence) and, simultaneously, ac-
curate frequency judgments.
The open-circle curve in Figure 6 shows the relation between
judgments of confidence and true relative frequency of correct
answers in the selected set—that is, the set of mixed general-
knowledge questions.3 The relative frequency of correct an-
swers (averaged over all subjects) was 72.4% in the 100%-confi-
dence category, 66.3% in the 95% category, 58.0% in the 85%
category, and so on. The curve is far below the diagonal (calibra-
tion curve) and similar to the curves reported by Lichtenstein,
Fischhoff, and Phillips (1982, Figure 2). It replicatesand demon-
strates the well-known overconfidence effect. Percentage
100
90
80
representative set
fj n matched set
O—O selected set
40
50 55 65 75 85
confidence (%)
95 100
Figure 6. Calibration curves for the selected set (open circles), repre-
sentative set (black squares), and matched set (open squares).
correct was 52.9, mean confidence was 66.7, and overconfi-
dence was 13.8.
Subjects' frequency judgments, however, are fairly accurate,
as Table 2 (last row) shows. Each entry is averaged over the 20
subjects in each condition. For instance, the figure —7.75
means that, on average, subjects in this condition underesti-
mated the true number of correct answers by 1.75. Averaged
across the four conditions, we get -1.2, which means that sub-
jects missed the true frequency by an average of only about 1
correct answer in the set of 50 questions. Quite accurate fre-
quency judgments coexist with overconfidence. The magni-
tudes of this confidence-frequency effect found is shown in
Figure 7 (left side). PMM theory predicts this systematic differ-
ence between confidence and frequency judgments, within the
same person and the same general-knowledge questions.
Prediction 2. PMM theory predicts that in the representa-
tive set (city questions) people show zero overconfidence and, at
the same time, underestimation in frequency judgments.
The solid-square curve in Figure 6 shows the relation be-
3 In Figure 6, we have represented the confidence category (91%-99%) by 95%, and similarly with the other categories. This choice can becriticized because numerical judgments of confidence often clusteraround specific values in an interval. (If there is a difference, however,we may expect that it affects the three curves in a similar way, withoutaltering the differences between curves.) In Experiment 2, we usedprecise values instead of these intervals.
PROBABILISTIC MENTAL MODELS 515
Table 2Mean Differences Between Estimated and True Frequencies of Correct Answers
Set
Representative1-50
51-100101-150151-200201-250251-300
Average
Selected
No information-no incentive
-9.94-9.50-9.88-6.67-9.79-9.47
-9.21
-1.75
Incentiveonly
-9.42-10.37-10.89-6.70-9.84
-10.84
-9.68
-0.60
Informationonly
-8.80-11.95-10.85-9.35-7.95-9.40
-9.72
-2.65
Informationand incentive
-8.74-11.25-9.90-5.90-5.25-9.05
-8.35
0.30
Note. Negative signs denote underestimation of true number of correct answers.
tween confidence and percentage correct in the representativeset—that is, the city questions. For instance, percentage correctin the 100%-confidence category was 90.8%, instead of 72.4%.Overconfidence disappeared (-0.9%). Percentage correct andmean confidence were 71.7 and 70.8, respectively.
The confidence curve for the representative set is similar to aregression curve for the estimation of relative frequencies byconfidence, resulting in underconfidence in the left part of the
over/under-estimation
20
15
10
5
0
-5
-10
-15
-20
———— Experiment 1
— — Experiment 2
confidence
frequency
I I
selected set representative set
Figure 7. Confidence-frequency effect in the representative and se-lected set. (Solid lines show results of Experiment 1, dotted lines showthose of Experiment 2. Frequency judgments are long-run frequencies,AT =50.)
confidence scale, overconfidence in the right, and zero over-confidence on the average.
Table 2 shows the differences between estimated and truefrequencies for each block of 50 items and each of the condi-tions, respectively. Again, each entry is averaged over the 20subjects in each condition. For instance, subjects who weregiven neither information nor incentive underestimated theirtrue number of correct answers by 9.94 (on the average) in thefirst 50 items of the representative set. Table 2 shows that thevalues of the mean differences were fairly stable over the sixsubsets, and, most important, they are, without exception, nega-tive (i£., underestimation). For each of the 24 cells (representa-tive set), the number of subjects with negative differences (un-derestimation) was compared with the number of positive dif-ferences (overestimation) by sign tests, and all 24 p values weresmaller than .01.
The following is an illustration at the individual level: Subject1 estimated 28, 30, 23, 25, 23, and 23, respectively, for the sixsubsets, compared with 40,38,40,36,35, and 32 correct solu-tions, respectively. An analysis of individual judgments con-firmed average results. Among the 80 subjects, 71 underesti-mated the number of correct answers, whereas only 8 subjectsoverestimated it (frequency judgments were missing for 1 sub-ject). Incidentally, 7 of these 8 subjects were male. In the se-lected set, for comparison, 44 subjects underestimated and 35subjects overestimated the number of correct answers, and 1subject got it exactly right.
We have attributed the emergence and disappearance ofoverconfidence to selection versus use of a representative set.One objection to this analysis is that the difference between theopen-circle and the solid-square curve in Figure 6 is con-founded with a difference in the content of both sets. The se-lected set includes a broad range of general-knowledge ques-tions, whereas the domain of the representative set (cities) isnecessarily more restricted. To check for this possible con-found, we determined the item difficulties for each of the 50general-knowledge questions and selected a subset of 50 cityquestions that had the same item difficulties. If the differencein Figure 6 is independent of content, but results from the selec-tion process, this "matched" subset of city questions shouldgenerate the same calibration curves showing overconfidenceas the selected set of general-knowledge questions did. Figure 6
516 G. GIGERENZER, U HOFFRAGE, AND H. K.LEINBOLTING
shows that this is the case (open-square curve). Both content
domains produce the same results if questions are selected.
To summarize, in the representative set, overestimation dis-
appears in confidence judgments, and zero-overconfidence
coexists with frequency judgments that show large underesti-
mation. Results confirm Prediction 2. Figure 7 (right side)
shows the magnitude of the confidence-frequency effect found.
No previous theory of confidence can predict the results de-
picted in Figure 7.
Prediction 3. PMM theory predicts that overestimation will
disappear if the relative frequencies of correct answers (percent-
age correct) in each confidence category is compared with the
estimated relative frequencies. Because subjects estimated per-
centage correct for all confidence judgments—that is, includ-
ing both the selected and the representative set—we expect not
only that overestimation will disappear (the prediction from
the selected set) but also that it will turn into underestimation
(the prediction from the representative set).
The solid line in Figure 8 shows the results for Experiment 1:
Estimated relative frequencies are well calibrated and show un-
derestimation in five out of seven confidence categories. Over-
estimation of one's knowledge disappears. The only exception
is the 100%-confidence category. The latter is the confidence
category that contains all solutions by local MMs, and errors in
memory or elementary logical operations may account for the
difference. Figure 8 is a "frequentist" variant of the calibration
curve of Figure 6. Here, true percentage correct is compared
with estimated percentage correct, rather than with confi-
dence. For instance, in the 100%-confidence category, true and
estimated percentage correct were 88.8% and 93.0%, respec-
tively.
50 60 70 80 90 100
estimated percentage correct
Figure 8. Calibration curves for judgments of percentage correct inconfidence categories. (Solid lines show results of Experiment 1, dottedlines show those of Experiment 2. Values are averaged across both setsof questions).
Averaged across experimental conditions, the ratio between
estimated frequency in the long run and confidence value is
fairly constant, around .87, for confidence ratings between 65%
and 95%. It is highest in the extreme categories (see Table 3).
To summarize, subjects explicitly distinguished between
confidence in single answers and the relative frequency of
correct answers associated with a confidence judgment. This
result is implied by PMM theory, according to which different
reference classes are cued by confidence and frequency tasks.
As stated in Prediction 3, overestimation disappeared. How-
ever, the magnitude of underestimation was not, as might be
expected, as pronounced as in the frequency judgments dealt
with in Predictions 1 and 2. Except for this finding, results con-
formed well to Prediction 3. Note that no previous theory of
confidence in knowledge we are aware of makes this conceptual
distinction and that prediction. Our results contradict much of
what has been assumed about how the untutored mind under-
stands the relation between confidence and relative frequency
of correct answers.
Information about overconfidence and monetary incen-
tive. Mean confidence judgments were indistinguishable be-
tween subjects informed about overconfidence and those unin-
formed. None of seven t tests, one for each confidence category,
resulted in p values smaller than .05. If a monetary incentive
was announced, overconfidence was more pronounced with
incentive than without incentive in five categories (65%-100%)
and less in the 50% category (all ps < .05), with an average
increase of 3.6%.
The monetary incentive effect resulted from the incentive/
no-information group, in which confidence judgments were
higher than in all three other groups (but we found the same
percentage correct in all groups). One reason for this interac-
tion could be that we did not specify in the instructions a crite-
rion for best performance. If warned of overconfidence, sub-
jects could easily infer that the incentive was for minimizing
overconfidence. If not warned, at least some subjects could also
have attempted to maximize percentage correct. None of these
attempts, however, was successful, consistent with PMM
theory and earlier studies (e.g., Fischhoff, Slovic, & Lichten-
stein, 1977). The effort to raise the percentage correct seems to
have raised confidence instead, an outcome that cannot be ac-
Table 3
Estimated and True Percentage Correct in Each Confidence
Category (Summarized Over the Representative
and the Selected Sets)
Confidencecategory
10091-9981-9071-8061-7051-60
50
SorM
No. ofconfidencejudgments
5,1661,6292,5342,9503,5064,0368,178
27,999
% correct
Estimated
93.082.773.164.357.353.749.8
64.8
True
88.881.674.670.165.663.356.3
69.1
Over-/under-estimation
4.21.1
-1.5-5.8-8.3-9.6-6.5
-4.2
PROBABILISTIC MENTAL MODELS 517
counted for by PMM theory. The size of this effect, however,was small compared with both the confidence-frequency effectand that of selected versus representative sampling.
To summarize, neither warning of overconfidence nor asso-ciated monetary incentive decreased overconfidence or in-creased percentage correct, replicating earlier findings thatknowledge about overconfidence is not sufficient to changeconfidence. An incentive that subjects seem to have interpretedas rewarding those who maximize the percentage correct, how-ever, increased confidence.
Order of presentation and sex. Which set (representative vs.selected) was given first had no effect on confidences, neither inExperiment 1 nor in Experiment 2. Arkes, Christensen, Lai,and Blutner (1987) found an effect of the difficulty of one groupof items on the confidence judgments for a second when sub-jects received feedback for their performance in the first set. Inour experiment, however, no feedback was given. Thus, sub-jects had no reason to correct their confidence judgments, suchas by subtracting a constant value. Sex differences in degree ofoverconfidence in knowledge have been claimed by both philo-sophy and folklore. Our study, however, showed no significantdifferences between the sexes in either overconfidence or cali-bration, in either Experiment 1 or in Experiment 2. (The men'sconfidence judgments were on the average 5% higher thanwomen's, but so was their percentage correct. This replicatesLichtenstein and FischhoflS, 1981, findings about students atthe University of Oregon.)
To summarize, as predicted by PMM theory, we can experi-mentally make overconfidence (overestimation) appear, disap-pear, and invert. Experiment 1 made our subjects consistentlyswitch back and forth among these responses. The key to thisfinding is a pair of concepts that have been neglected by themain previous explanations of confidence in one's knowledge—confidence versus frequency judgment and representativeversus selected sampling.
Experiment 2
We tried to replicate the facts and test several objections.First, to strengthen the case against this theory, we instructedthe subjects both verbally and in written form that confidence issubjective probability, and that among all cases where a subjec-tive probability of X% was chosen, X% of the answers should becorrect. Several authors have argued that such a frequentist in-struction could enhance external calibration or internal consis-tency (e.g, Kahneman & Tversky, 1982; May, 1987). Accordingto PMM theory, however, confidence is already inferred fromfrequency (with or without this instruction)—but from frequen-cies of co-occurrences between, say, number of inhabitants andseveral cues, and not from base rates of correct answers in simi-lar testing situations (see Table 1). Thus, in our view, the preced-ing caution will be ineffective because the base rate of correctanswers is not a probability cue that is defined on a referenceclass such as cities in Germany.
Second, consider the confidence-frequency effect. We haveshown that this new effect is implied by PMM theory. Oneobjection might be that the difference between confidence andfrequency judgments is an artifact of the response function,just as overconfidence has sometimes been thought to be. Con-
sider the following interpretation of overconfidence. If (a) confi-dence is well calibrated but (b) the response function that trans-forms confidence into a confidence judgment differs from anidentity function, then (c) overconfidence or underconfidence"occurs" on the response scale. Because an identity function hasnot been proven, Anderson (1986), for instance, denoted theoverconfidence effect and the hard-easy effect as "largely mean-ingless" (p. 91): They might just as well be response functionartifacts.
A similar objection could be made against the interpretationof the confidence-frequency effect within PMM theory. De-spite the effect's stability across selected and representative sets,it may just reflect a systematic difference between responsefunctions for confidence and frequency judgments. This con-jecture can be rephrased as follows: If (a) the difference between"internal" confidence and frequency impression is zero, but (b)the response functions that transform both into judgmentsdiffer systematically, then (c) a confidence-frequency effect oc-curs on the response scales. We call this the response-Junctionconjecture.
How can this conjecture be tested? According to PMMtheory, the essential basis on which both confidence and fre-quency judgments are formed is the probability cues, not re-sponse functions. We assumed earlier that frequency judgmentsare based mainly on base rates of correct answers in a referenceclass of similar general-knowledge test situations. If we makeanother cue available, then frequency judgments shouldchange. In particular, if we make the confidence judgmentsmore easily retrievable from memory, these can be used as addi-tional probability cues, and the confidence-frequency effectshould decrease. This was done in Experiment 2 by introducingfrequency judgments in the short run, that is, frequency judg-ments for a very small number of questions. Here, confidencejudgments can be more easily retrieved from memory thanthey could in the long run. Thus, if PMM theory is correct, theconfidence-frequency effect should decrease in the short run. Ifthe issue were, however, different response functions, then theavailability of confidence judgments should not matter becauseconfidence and frequency impression are assumed to be identi-cal in the first place. Thus, if the conjecture is correct, theconfidence-frequency effect should be stable.
In Experiment 2, we varied the length Nofa series of ques-tions from the long run condition N= 50 in Experiment 1 to thesmallest possible short run of N = 2.
Third, in Experiment 1 we used a response scale rangingfrom 50% to 100% for confidence judgments but a full-rangeresponse scale for frequency judgments ranging from 0 to 50correct answers (which corresponds to 0% to 100%). Thereforeone could argue that the confidence-frequency effect is an arti-fact of the different ranges of the two response scales. Assumethat (a) there is no difference between internal confidence andfrequency, but (b) because confidence judgments are limited tothe upper half of the response scale, whereas frequency judg-ments are not, (c) the confidence-frequency effect results as anartifact of the half-range response scale in confidence judg-ments. We refer to this as the response-range conjecture. It can bebacked up by at least two hypotheses.
1. Assume that PMM theory is wrong and subjects indeeduse base rates of correct answers as a probability cue for confi-
518 G. GIGERENZER, U HOFFRAGE, AND H. KLEINBOLTING
dence in single answers. Then confidence should be consider-
ably lower. If subjects anticipate misleading questions, even
confidences lower than 50% are reasonable to expect on this
conjecture. Confidences below 50%, however, cannot be ex-
pressed on a scale with a lower boundary at 50%, whereas they
can at the frequency scale. Effects of response range such as
those postulated in range-frequency theory (Parducci, 1965) or
by Schonemann (1983) may enforce the distorting effect of the
half-range format. In this account, both the overconfidence ef-
fect and the confidence-frequency effect are generated by a
response-scale effect. With respect to overconfidence, this con-
jecture has been made and has claimed some support (e.g. May,
1986, 1987; Ronis & Yates, 1987). We call this the base rate
hypothesis.
2. Assume that PMM theory is wrong in postulating that
choice and confidence are essentially one process and that the
true process is a temporal sequence: choice, followed by search
for evidence, followed by confidence judgment. Koriat et al.
(1980), for instance, proposed this sequence. Assume further,
contrary to Koriat, that the mind is "Popperian," searching for
disconfirming rather than for confirming evidence to deter-
mine the degree of "corroboration" of an answer. If the subject
is successful in retrieving disconfirming evidence from mem-
ory, but is not allowed to change the original answer, confidence
judgments less than 50% will result. Such disconfirmation strat-
egies, however, can hardly be detected using a 50%-100% for-
mat, whereas they could in a full-scale format. We call this the
disconfirmation strategy hypothesis.
To test the response-range conjecture, half of the subjects in
Experiment 2 were given full-range response scales, whereas
the other half received the response scales used in Experi-
ment 1.
Method
Subjects. Ninety-seven new subjects at the University of Konstanz
(not enrolled in psychology) were paid for participation. There were 59male and 38 female subjects. As in Experiment 1, subjects were testedin small groups of no more than 7 subjects.
Design and procedure. This was a 4 X 2 X 2 design, with length of
series (50,10, 5, and 2) and response scale (half range vs. full range)varied between subjects and type of knowledge questions (selected vs.representative set) varied within subjects.
The procedure and the materials were like that in Experiment 1,except for the following. We used a new random sample of 21 (insteadof 25) cities. This change decreased the number of questions in the
representative set from 300 to 210. As mentioned earlier, we explicitlyinstructed the subjects to interpret confidences as frequencies ofcorrect answers: "We are interested in how well you can estimate sub-jective probabilities. This means, among all the answers where you give
a subjective probability of X%, there should be X% of the answerscorrect." This calibration instruction was orally repeated and empha-sized to the subjects.
The response scale contained the means (50%, 55%, 65%,. . . ,95%,100%) of the intervals used in Experiment 1 rather than the intervals
themselves to avoid the problematic assumption that means wouldrepresent intervals. Endpoints were marked absolutely certain that the
alternative chosen is correct (100%), both alternatives equally probable(50%), and, for the full-range scale, absolutely certain that the alternativechosen is incorrect (0%). In the full-range scale, one reason for usingconfidences between 0% and 45% was explained in the following illus-
tration: "If you think after you have made your choice that you wouldhave better chosen the other alternative, do not change your choice, butanswer with a probability smaller than 50%."
After each set of N= 50 (10,5, or 2) answers, subjects gave a judgmentof the number of correct answers. After having completed 50 + 210 =260 confidence judgments and 5, 26, 52, or 130 frequency judgments(depending on the subject's group), subjects in both response-scale con-
ditions were presented the same enlarged copy of the 50%-100% re-sponse scale and asked to estimate the relative frequency of correctanswers in each confidence category.
Results
Response-range conjecture. We tested the conjecture that the
systematic difference in confidence and frequency judgments
stated in Predictions 1 and 2 (confidence-frequency effect) and
shown in Experiment 1 resulted from the availability of only a
limited response scale for confidence judgments (50% to 100%).
Forty-seven subjects were given the full-range response scale
for confidence judgments. Twenty-two of these never chose
confidences below 50%; the others did. The number of confi-
dence judgments below 50% was small. Eleven subjects used
them only once (in altogether 260 judgments), 5 did twice, and
the others 3 to 7 times. There was one outlier, a subject who
used them 67 times. In total, subjects gave a confidence judg-
ment smaller than 50% for only 1.1% of their answers (excluding
the outlier: 0.6%). If the response-range conjecture had been
correct, subjects would have used confidence judgments below
50% much more frequently.
In the representative set, overconfidence was 3.7% (SEM =
1.23) in the full-range scale condition and 1.8% (SEU= 1.15) in
the half-range condition. In the selected set, the corresponding
values were 14.4 (SEM = 1.54) and 16.4 (SEM= 1.43). Averaging
all questions, we got slightly larger overconfidence in the full-
range condition (mean difference = 1.2). The response-range
conjecture, however, predicted a strong effect in the opposite
direction. Frequency judgments were essentially the same in
both conditions. Hence, the confidence-frequency effect can
also be demonstrated when both confidence and frequency
judgments are made on a full-range response scale.
To summarize, there was (a) little use of confidences below
50% and (b) no decrease of overconfidence in the full-range
condition. These results contradict the response-range conjec-
ture.
A study by Ronis and Yates (1987) seems to be the only other
study that has compared the full-range and the half-range for-
mat in two-alternative choice tasks, but it did not deal with
frequency judgments. These authors also reported that only
about half their subjects used confidence judgments below 50%,
although they did so more frequently than our subjects. Ronis
and Yates concluded that confidences below 50% had only a
negligible effect on overconfidence and calibration (pp. 209-
211). Thus, results in both studies are consistent. The main
difference is that Ronis and Yates seem to consider only "failure
to follow the instructions" and "misusing the probability scale"
(p. 207) as possible explanations for confidence judgments be-
low 50%. In contrast, we argue that there are indeed plausible
cognitive mechanisms—the base rate and disconfirmation strat-
egy hypotheses—that imply these kind of judgments, although
they would contradict PMM theory
PROBABILISTIC MENTAL MODELS 519
Both Experiment 2 and the Ronis and Yates (1987) study do
not rule out, however, a more fundamental conjecture that is
difficult to test. This argument is that internal confidence (not
frequency) takes a verbal rather than a numerical form and that
it is distorted on any numerical probability rating scale, not just
on a 50%-100% response scale. Zimmer (1983, 1986) argued
that verbal expressions of uncertainty (such as "highly improb-
ably" and "very likely") are more realistic, more precise, and
less prone to overconfidence and other so-called judgmental
biases than are numerical judgments of probability. Zimmer s
fuzzy-set modeling of verbal expressions, like models of proba-
bilistic reasoning that dispense with the Kolmogoroff axioms
(e.g., Cohen, 1989; Kyburg, 1983; Shafer, 1978), remains a largely
unexplored source of alternative accounts of confidence.
For the remaining analysis, we do not distinguish between
the full-range and the half-range response format. For combin-
ing the data, we receded answers like "alternative a, 40% confi-
dent" as "alternative b, 60% confident," following Ronis and
Yates (1987).
Predictions I and 2: Confidence-frequency effect. The ques-
tion is whether the confidence-frequency effect can be repli-
cated under the explicit instruction that subjective probabilities
should be calibrated to frequencies of correct answers in the
long run. Calibration curves in Experiment 2 were similar to
those in Figure 6 and are not shown here for this reason. Figure
7 shows that the confidence-frequency effect replicates. In the
selected set, mean confidence was 71.6%, and percentage
correct was 56.2. Mean estimated number of correct answers
(transformed into percentages) in the series ofN= 50 was 52.0%.
As stated in Prediction 1, overconfidence in single answers
coexists with fairly accurate frequency judgments, which once
again show slight underestimation.
In the representative set, mean confidence was 78.1% and
percentage correct was 75.3%.4 Mean estimated number of
correct answers per 50 answers was 63.5%. As forecasted in
Prediction 2, overconfidence largely disappeared (2.8%), and
frequency judgments showed underestimation (—11.8%).
An individual analysis produced similar results. The confi-
dence-frequency effect (average confidence higher than average
frequency judgment) held for 82 (83) subjects in the selected
(representative) set (out of 97). Answering the selected set, 92
respondents showed overconfidence, and 5 showed underconfi-
dence. In the representative set, however, 60 exhibited overcon-
fidence and 37, underconfidence.
Prediction 3: Estimated percentage correct in confidence cate-
gories. After the subjects answered the 260 general-knowledge
questions, they were asked what percentage they thought they
had correct in each confidence category. As shown by the
dashed line in Figure 8, results replicated well. Average esti-
mated percentage correct differed again from confidence and
was close to the actual percentage correct.
Despite the instruction not to do so, our subjects still distin-
guished between a specific confidence value and the corre-
sponding percentage of correct responses. Therefore confidence
and hypothesized percentage correct should not be used as syn-
onyms (e.g., Dawes, 1980, pp. 331-345). As suggested by this
experiment, an instruction alone cannot override the cognitive
processes at work.
In the 100%-confidence category, for instance, 67 subjects
gave estimates below 100%. In a postexperimental interview, we
pointed out to them that these judgments imply that they as-
sumed they had not followed the calibration instruction. Most
subjects explained that in each single case, they were in fact
100% confident. But they also knew that, in the long run, some
answers would nonetheless be wrong, and they did not know
which ones. Thus, they did not know which of the 100% answers
they should correct. When asked how they made the confi-
dence judgments, most subjects answered by giving examples
of probability cues, such as "I know that this city is located in
the Ruhigebiet, and most cities there are rather large." Inter-
views provided evidence for several probability cues, but no
evidence that base rate expectations, as reported in frequency
judgments, were also used in confidence judgments.
Response-function conjecture: frequency judgments in the
short and long runs. We tested the conjecture that the confi-
dence-frequency effect stated in Predictions 1 and 2 and shown
in Experiment 1 might be due to different response functions
for confidence and frequency judgments, rather than to differ-
ent cognitive processes as postulated by PMM theory If the
conjecture were true, the availability of confidence judgments
in the short run should not change the confidence-frequency
effect (see the previous discussion).
Contrary to the response-function conjecture, the length of
series showed a significant effect on the judgments of frequency
of correct answers in each series (p = .025) as well as on the
difference between judged and true frequency (p = .012). Fig-
ure 9 shows the extent of the disappearance of the confidence-
frequency effect in the short run. The curve shows that the
effect decreased from N = 50 to N = 2, averaged across both sets
of items. The decrease was around 12%, an amount similar in
the selected set (from 18.9% to 6.9%) and in the representative
set (from 15.7% to 3.3%).
As would be expected from both the response-function con-
jecture and PMM theory, an analysis of variance over all 260
questions showed no significant effect of length of series (short
vs. long runs) on either confidence judgment (p = .39) or num-
ber of correct answers (p = .40). (Similar results were obtained
when the selected and representative sets were tested separately.)
The breakdown of the confidence-frequency effect in the
short run is inconsistent with the objection that the effect can
be reduced to a systematic difference in response functions.
This result is, however, consistent with the notion that the
shorter the run, the more easily are confidence judgments avail-
able from memory, and, thus, the more they can be used as
probability cues for the true number of correct answers.
Discussion
Our starting point was the overconfidence effect, reported in
the literature as a fairly stable cognitive illusion in evaluating
one's general knowledge and attributed to general principles of
memory search, such as confirmation bias (Koriat et al., 1980),
4 Confidence and percentage correct are averaged across all fourconditions (series length) because these do not differ systematicallyamong conditions. For comparison, the corresponding values for con-fidence and percentage correct in the ff = 50 condition are 71.0 and56.8 in the selected set and 79.2 and 74.5 in the representative set.
520 G. QIGERENZER, U. HOFFRAGE, AND H. KLEINBOLTING
difference
between 15
confidence
and
frequency
judgments
10
50 10 5 2
length of series
Figure 9. Decrease of the confidence-frequency effect in short runs
(N= 50,10,5, and 2). (Values are differences between mean confidence
and estimated percentage correct in a series of length N. Values are
averaged across all questions).
to general motivational tendencies such as fear of invalidity(Mayseless & Kruglanski, 1987), to insensitivity to task diffi-culty (see von Winterfeldt & Edwards, 1986, p. 128), and towishful thinking and other "deficits" in cognition, motivation,and personality Our view, in contrast, proposes that one evalu-ates one's knowledge by probabilistic mental models. In ouraccount, the main deficit of most cognitive and motivationalexplanations is that they neglect the structure of the task and itsrelation to the structure of a corresponding environmentknown to the subjects. If people want to search for confirmingevidence or to believe that their answers are more correct thanthey are because of some need, wish, or fear, then overestima-tion of accuracy should express itself independently of whetherthey judge single answers or frequencies, a selected or represen-tative sample of questions, and hard or easy questions.
Our experiments also do not support the explanation of over-confidence and the hard-easy effect by assuming that subjectsare insensitive to task difficulty: In frequency tasks we haveshown that subjects' judgments of their percentage correct inthe long run are in fact close to actual percentage correct, al-though confidences are not. Overconfidence does not implythat subjects are not aware of task difficulty. At least two morestudies have shown that estimated percentage correct cancorrespond closely to true percentage correct in general-knowledge tasks. Allwood and Montgomery (1987) asked theirsubjects to estimate how difficult each of 80 questions was fortheir peers and found that difficulty ratings (M = 57%) weremore realistic (percentage correct = 61%) than confidence judg-ments (M = 74%). May (1987) asked her subjects to estimatetheir percentage of correct answers after they completed anexperiment with two-alternative questions. She found thatjudgments of percentage correct accorded better with the truepercentage correct than did confidences.
On our account, over-confidence results from one of twocauses, or both: (a) A PMM for a task is not property adapted toa corresponding environment (e.g, cue validities do not corre-spond to ecological validities), or (b) the set of objects used isnot a representative sample from the corresponding referenceclass in the environment but is selected for difficulty. If a is thetrue cause, using a representative sample from a known environ-ment should not eliminate Overconfidence. If b is true, it
should. In both experiments, Overconfidence in knowledgeabout city populations was eliminated, as implied by b. Thus,experimental results are consistent with both PMM theory andthe assumption that individual PMMs are on the average welladapted to the city environment we used.5 Overconfidence re-sulted in a set of questions that was selected for difficulty. Un-derconfidence, conversely, would result from questions selectedto be easy.
The foregoing comments do not mean that overestimation ofknowledge is just an artifact of selected questions. If it were,then judgments of frequency of correct answers should show asimilar degree of overestimation. What we have called the con-fidence-frequency effect shows that this is not the case.
Several authors have proposed that judgments in the fre-quency mode are more accurate, realistic, or internally consis-tent than probabilities for single events (e.g., Teigen, 1974, p. 62;Tversky & Kahneman, 1983). Our account is different. PMMtheory states conditions under which mean judgments of confi-dence are systematically larger than judgments of relative fre-quency. PMM theory does not, however, imply that frequencyjudgments are generally better calibrated. On the contrary, fre-quency judgments may be miscalibrated for the same reasonsas confidence judgments. The set of tasks may not be represen-tative for the reference class from which the inferences aremade.
The experimental control of overestimation—how to makeoverestimation appear, disappear, and invert—gives support toPMM theory. These predictions, however, do not exhaust theinferences that can be derived from PMM theory.
Explaining Anomalies in the Literature
In this section, we explain a series of apparently inconsistentfindings and integrate these into PMM theory.
Ronis and Yates (1987). We have mentioned that the Ronisand Yates (1987) study is the only other study that tested afull-range response scale for two-alternative tasks. The secondpurpose of that study was to compare confidence judgments insituations where the subject knows that the answers are knownto the experimenter (general-knowledge questions) with out-comes of upcoming basketball games, where answers are notyet known. In all three (response-scale) groups, percentagecorrect was larger for general-knowledge questions than forbasketball predictions. Given this result, what would currenttheories predict about Overconfidence? The insensitivity hy-pothesis proposes that people are largely insensitive to percent-age correct (see von Winterfeldt & Edwards, 1986, p. 128). Thisimplies that Overconfidence will be larger in the more difficult(hard) set: the hard-easy effect. (The confirmation bias andmotivational explanations are largely mute on the difficulty is-sue.) PMM theory, in contrast, predicts that Overconfidence willbe larger in the easier set (hard-easy effect reversal, see Predic-tion 5) because general-knowledge questions (the easy set) were
3 After finishing this article, we learned about a study by Juslin
(1991), in which random samples were drawn from several natural
environments. Overall, Overconfidence in general knowledge was close
to zero, consistent with this study.
PROBABILISTIC MENTAL MODELS 521
selected and basketball predictions were not; only with clairvoy-ance could one select these predictions for percentage correct.
In fact, Ronis and Yates (1987) reported an apparent anom-aly: three hard-easy effect reversals. In all groups, overconfi-dence was larger for the easy general-knowledge questions thanfor the hard basketball predictions (Figure 10). Ronis and Yatesseem not to have found an explanation for these reversals of thehard-easy effect.
Koriat et al. (1980). Experiment 2 of Koriat et al.'s (1980)study provided a direct test of the confirmation bias explana-tion of overconfidence. The explanation is this: (a) Subjects firstchoose an answer based on their knowledge, then (b) they selec-tively search for confirming memory (or for evidence discon-firming the alternative not chosen), and (c) this confirming evi-dence generates overconfidence. Between the subjects' choiceof an answer and their confidence judgment, the authors askedthe subjects to give reasons for the alternative chosen. Threegroups of subjects were asked to write down one confirmingreason, one disconfirming reason, or one of each, respectively.Reasons were given for half of the general-knowledge ques-tions; otherwise, no reasons were given (control condition). Ifthe confirmation bias explanation is correct, then asking for acontradicting reason (or both reasons) should decrease overcon-fidence and improve calibration. Asking for a confirming rea-son, however, should make no difference "since those instruc-tions roughly simulate what people normally do" (Koriat et al.,1980, p. 111).
What does PMM theory predict? According to PMM theory,choice and confidence are inferred from the same activated cue.This cue is by definition a confirming reason. Therefore, theconfirming-reason and the no-reason (control) tasks engage thesame cognitive processes. The difference is only that in theformer the supporting reason is written down. Similarly, thedisconfirming-reason and both-reason tasks involve the same
over/under-confidence
, onis&Yate«(L987):No-choice-ISO
. :onis*Yat»s(1987>:Choice-50
'Ranis & Yates(1987):Choice-l(H)
percentagecorrect
Figure 10. Reversal of the hard-easy effect in Ronisand Yates (1987) and Keren (1988).
cognitive processes. Furthermore, PMM theory implies thatthere is no difference between the two pairs of tasks.
This result is shown in Table 4. In the first row we have theno-reason and confirming-reason tasks, which are equivalent.Here, only one cue is activated, which is confirming. There is nodisconfirming cue. Now consider the second row, the discon-firming-reason and both-reason tasks, which are again equiva-lent. Both tasks are solved if one additional cue, which is dis-confirming, can be activated. Thus, for PMM theory, the cuegeneration and testing cycle is started again, and cues are gener-ated according to the hierarchy of cue validities and testedwhether they can be activated for the problem at hand. Thepoint is that the next cue that can be activated may turn out tobe either confirming or disconfirming.
Far simplicity, assume that the probability that the next acti-vated cue turns out to be confirming or disconfirming is thesame. If it is disconfirming, the cycle is stopped, and two cuesin total have been activated, one confirming and one discon-firming. This stopping happens with probability .5, and it de-creases both confidence and overconfidence. (Because the sec-ond cue activated has a smaller cue validity, however, confi-dence is not decreased below 50%) If the second cue activated isagain confirming, a third has to be activated, and the cue gener-ation and testing cycle is entered again. If the third cue is dis-confirming, the cycle stops with two confirming cues and onedisconfirming cue activated, as shown in the third row of Table4. This stopping is to be expected with probability .25. Becausethe second cue has higher cue validity than the third, discon-firming, cue, overall an increase in confidence and overconfi-dence is to be expected. If the third cue is again confirming, thesame procedure is repeated. Here and in all subsequent casesconfidence will increase. As shown in Table 4, the probabilitiesof an increase sum up to .5 (.25 + . 125 + . 125), which is the sameas the probability of a decrease.
Thus, PMM theory leads to the prediction that, overall, ask-ing for a disconfirming reason will not change confidence oroverconfidence. As just shown, the confirmation-bias hypothe-sis, in contrast, predicts that asking for a disconfirming reasonshould decrease confidence and overconfidence.
What were the results of the Koriat study? In both crucialconditions, disconfirming reason and both reasons, the authorsfound only small and non-significant decreases of overconfi-dence (2% and 1%, respectively) and similar small improve-ments in calibration (006 each, significant only in the discon-firming-reason task). These largely insignificant differences areconsistent with the prediction by PMM theory that asking for adisconfirming reason makes no difference and are inconsistentwith the confirmation-bias explanation. Further evidencecomes from a replication of the Koriat study by Fischhoff andMacGregor (1982), who reported zero effects of disconfirmingreasons.
To summarize, the effects on confidence of giving confirm-ing and disconfirming reasons in the Koriat study can be bothexplained by and integrated into PMM theory. There is no needto postulate a confirmation bias.
Dawes (1980). Overconfidence has been attributed to peo-ple's tendency to "overestimate the power of our 'intellect' asopposed to that of our coding abilities." Such overestimation"has been reinforced by our realization that we have developed
522 G. GIGERENZER, U. HOFFRAGE, AND H. KLEINBOLTING
Table 4Predictions ofPMM Theory for the Effects of Asking for a Disconfirming Reason
Cues activatedPredicted change in
Task No. of cues activated CON DIS Probability confidence
No; CON
DIS; both
DIS; both
DIS; both
DIS; both
1
2
3
4
>4
CON/ \
DIS CON/
DIS\CON/ \
DIS CON
1
1
2
3
>3
0
1
1
1
1
.5
.25
.125
.125
Decrease
Increase
Increase
Increase
Note. CON = confirming reason; DIS = disconfirming reason; No = no reason; both = both reasons.
a technology capable of destroying ourselves" (Dawes 1980, p.328). Dawes (1980) proposed that overconfidence is character-istic for general-knowledge questions but absent in perceptualtasks; he designed a series of experiments to test this proposal.PMM theory, however, gives no special treatment to perceptualtasks. On the contrary, it predicts overconfidence if perceptualtasks are selected for perceptual illusions—that is, for beingmisleading—whereas zero overconfidence is to be expected iftasks are not selected. Pictures in textbooks on visual illusionsare probably the set of items that produces the most extremeoverconfidence yet demonstrated. Nevertheless, in a naturalenvironment, perception is generally reliable.
Dawes reported inconsistent results. When perceptual stim-uli were systematically constructed from a Square x Circle ma-trix, as in the area task, and no selection for stimuli that gener-ated perceptual illusions took place, overconfidence was closeto zero (perception of areas of squares is quite well adapted inadults; see Gigerenzer & Richter, 1990). This result is predictedby both accounts. The anomaly arises with the second percep-tual task used—judging which of two subsequent tones islonger.6 If the second tone was longer, Dawes reported almostperfect calibration, but if the first tone was longer, subjectsexhibited large overconfidence.
PMM theory predicts that in the inconsistent acoustic task,perceptual stimuli have been selected (albeit unwittingly) for aperceptual illusion. This is in fact the case. From the literatureon time perception, we know that of two subsequently pre-sented tones, the tone more recently heard appears to be longer.This perceptual illusion is known as the negative presentation
effect (e.g., Fraisse, 1964; Sivyer & Finlay, 1982). It implies asmaller percentage of correct answers in the condition wherethe tone presented first was longer, because this tone is per-ceived to be shorter. A decrease in percentage correct in turnincreases overconfidence. In Dawes's (1980) experiments, this isexactly the inconsistent condition where overconfidence oc-curred. Thus, from the perspective we propose, this inconsis-tent result can be reconciled.
Keren (1988). A strict distinction between perceptual judg-ment and intellectual judgment cannot be derived from manyviews of perception, such as signal-detection theory (Tanner &Swets, 1954). Reflecting on this fact, Keren (1988) proposed aslightly modified hypothesis: The more perceptionlike a task is,
the less overconfident and the better calibrated subjects will be."As a task requires additional higher processing and transfor-mation of the original sensory input, different kinds of possiblecognitive distortions may exist (such as inappropriate infer-ences) that may limit the ability to accurately monitor ourhigher cognitive processes" (Keren, 1988, p. 99).
Keren (1988, Experiment 1) used general-knowledge ques-tions and two kinds of perceptual tasks, one of them moredifficult than the general-knowledge task, the other less diffi-cult. Keren tested the hypothesis that confidence judgments inperceptual tasks are better calibrated than in general-knowl-edge tasks. He could not support it, however. Instead, he foundan anomaly: The comparison between the general-knowledgetask and the more difficult perceptual task reversed the hard-easy effect (see Figure 10). As derived in Prediction 5, this puz-zling reversal is implied by PMM theory if the Landolt ringsused in the more difficult perceptual task were not selected forperceptual illusions, as seems to be the case (Keren, 1988,p. 100).
Note that the kind of general-knowledge questions used (pop-ulation of cities or countries, and distances between cities)would easily permit defining a reference class in a known envi-ronment and obtaining representative samples. But no represen-tative sample of general-knowledge questions was generated.This lack makes the other predictions from PMM theory coin-cide with Keren's (1988): overconfidence in general-knowledge,and zero overconfidence in the two perceptual tasks. Resultsshow this outcome, except for the large-gap Landolt rings con-dition, which generated considerable underconfidence. PMMtheory cannot account for the latter, nor can the notion of de-gree of perception-likeness. j
A second perceptual task was lettdr identification. In Experi-ment 3, Keren (1988) used two letter-identification tasks, whichwere identical except that the exposure time of the letters to berecognized was either short or long. Mean percentages correctwere 63.5 for short and 77.2 for long exposures. According toearlier explanations such as subjects' insensitivity to task diffi-culty, a hard-easy effect should result. According to PMM
' Dawes's eye-color task is not dealt with here because it is a memory
task, not a perceptual task.
PROBABILISTIC MENTAL MODELS 523
theory, however, the hard-easy effect should be zero, becauseboth tasks were generated by the same sampling process (Pre-diction 4). In fact, Keren (1988, p. 112) reported that in bothtasks, overconfidence was not significantly different from zero.He seems to have found no explanation for this disappearanceof the hard-easy effect in a situation where differences in per-centage correct were large.7
Mental Models as Probabilistic Syllogisms
To the best of our knowledge, there is only one other mentalmodels approach to confidence. May (1986,1987) emphasizedthe role of mental models to understand the mechanism ofconfidence judgments and the role of misleading questions tocause overconfidence. Consider again the following question:"Which city has more inhabitants? (a) Hyderabad or (b) Islama-bad." May proposed that subjects answer this question by con-structing a mental model that can be expressed as a "probabilis-tic syllogistic inference" (May, 1986, p. 21):
"Most capitals have quite a lot of inhabitants.Islamabad is a capital.
Presumably, Islamabad has quite a lot of inhabitants."
Replacing "most" by the probability p(large|capital) = 1 -alpha, that is, the probability that a city has a large population ifit is a capital, she made the following argument:
"The subjective probability therefore has to be '1 - alpha.' Giventhat the perception of that percentage is correct, the observedfrequency of correct answers will be 1 - alpha. Even if there wasrandom fluctuation of subjective probability and task difficulty, inthe long run calibration is expected" (May, 1986, p. 20).
May (1986,1987) did highly interesting analyses of individualgeneral-knowledge questions—among others, direct tests of theuse of the capital cue and the familiarity cue in questions of theHyderabad-Islamabad type. Both May and PMM theory sharean emphasis on the mental models subjects use to construetheir tasks. We discuss here two issues where we believe thatMay^ position could be strengthened: the kind of mentalmodel she proposed and the role of misleading questions.
First, we show that the probabilistic syllogism does not workin the sense she specified, that is, generating long-run calibra-tion. We propose a working version. The general reason why thesyllogism mental model does not work in a two-alternative taskis that it does not deal with information about the alternative,that is, whether Hyderabad is known as a capital or noncapitalor whether its status is unknown. Specifically, assume that thesubject knows nothing about Hyderabad, as May did (1986, p.19). The syllogism produces the choice a and the confidencejudgment p(p = large|a = capital), where a stands for Islamabad.However, this confidence judgment is not equal to the long-runfrequency of correct answers. The long-run frequency of correctanswers p<fl larger than b\a = capital) depends on both p(a =largeja = capital) and p(b = large|6 = city), where b stands forHyderabad and "city" means no knowledge of whether b is acapital or a noncapital. For instance, the larger p(b = large|£ =city), the smaller the long-run frequency of correct answers.
A mental model that generates confidence judgments thatare well calibrated with long-run frequencies of correct answerscan be constructed in at least two ways. First, the probabilistic
syllogism can be supplanted by a second syllogism that uses ourknowledge about the alternative—capital, noncapital, or justcity (no knowledge of whether it is a capital or noncapital). Hereis a numerical illustration of what can be called the double-syl-logism model:
80% of capitals have a large population.Islamabad is a capital.
The probability is .80 that Islamabad has a large population.
40% of cities have a large population.Hyderabad is a city.
The probability is .40 that Hyderabad has a large population.
What is the long-run frequency of correct answers? There arefour possible classes of events: a - large and b = not large; b =large and a = not large; a = large and b = large; and a = not largeand b = not large. Given the choice a, the first class of eventssignifies correct choices; the second, incorrect choices; and thethird and fourth do not contain discriminating information.The long-run frequency of correct answers consists of the firstclass of events and of half of the third and fourth—on the as-sumption that half of the nondiscriminating cases will becorrect and the other half incorrect. Thus, the long-run fre-quency of correct answers is #(a = large, b = not largeja = capi-tal) + 'A(p[a = large, b = largeja = capital] + p[a = not large, b =not large|a = capital)). We denote the probabilities from thefirst and the second syllogism as a and 0, respectively. Then, thelong-run frequency of correct answers is a(l — {Tf + Vilpfl + [1 —<*][! ~ W> which is Vrfpt -0 + 1). For instance, if a = /J, thisprobability is .50. Therefore, in the above double-syllogismmodel, the (calibrated) confidence that a is correct is 'Afpt - 0 +1). May (1986), in contrast, proposed a. For the double syllo-gism, we get V2(.80 - .40 + 1) = .70.
A second solution would be to dispense with the dichotomyof large versus small population and to use the cue validities asdefined in PMM theory. Both changes would make May's men-tal models work and would lead to a mechanism that differs in
7 Keren (1987, 1988; Wagenaar & Keren, 1985) also distinguished
tasks in which the items are related (e.g., repeated weather forecasting)
versus unrelated (e.g., typical general-knowledge questions). A similar
distinction was made by Ronis and Vates (1987). PMM theory can
connect Keren's distinction between two kinds of tasks with a model of
cognitive processes involved in different tasks. In a set of unrelated
items, a new PPM has to be constructed for each new item that cannotbe answered by a local MM. This new PMM includes a new reference
class, new target variable, and new cues and cue validities. This holds
for reasoning about a set of typical general-knowledge questions. In
contrast, the representative set of city questions used in our experi-
ments implies that the PMMs for subsequent items include the same
reference class, same target value, and same hierarchy of cues and cue
validities but that different cues will be activated in different ques-
tions. Thus, in this framework, the distinction between related and
unrelated items is neither a dichotomy nor a single continuum, but
multidimensional. In general, a set of items can cue a series of PMMs
that have (a) the same-difierent reference class, (b) the same-differenttarget variable, (c) the same-different set of cues and cue validities, and
(d) the same-different activated cues. Thus, at the other extreme of the
typical general-knowledge task, there is a series of tasks that implies
the construction of a succession of PMMs that are identical with re-
spect to all four dimensions. An example is the repeated judgment of
the frequency of correct answers in our experiments.
524 G. GIGERENZER, U HOFFRAGE, AND H. KLEINBOLTING
an interesting way from that illustrated in Figure 3. In contrast
to what we have proposed, May (1986) assumed that cues are
activated even if knowledge from memory can be retrieved only
for one alternative.
May (1986,1987) also proposed that overconfidence is due to
misleading items. The Islamabad-Hyderabad question is one
example of a misleading item with less than 50% correct an-
swers. Most subjects chose Islamabad, whereas Hyderabad has
a much larger population. An extreme example is "Three
fourths of the world's cacao comes from (a) Africa or (b) South
America" for which Fischhoffet al. (1977) reported only 4.8%
correct answers. Fischhoffet al. showed that extreme overconfi-
dence, such as odds greater than 50:1, are prevalent in what they
called deceptive items (more than 73%) but still exist in nonde-
ceptive items (less than 9%). Because of the latter finding,
among other grounds, they concluded that misleading items
"are not responsible for the extreme overconfidence effect" (p.
561). In contrast, May (1986,1987) seems to have held that if
misleading items were eliminated, overconfidence would be,
too. But she also emphasized, as we do, the role of representa-
tive sampling.
We propose that the issue of misleading questions can be
fully reduced to the issue of representative sampling from a
reference class, which provides a deeper understanding of con-
fidence in knowledge. As we have pointed out before, the same
set of general-knowledge questions can be nonrepresentative
with respect to one reference class (e.g., all cities in Germany),
but representative with respect to a different reference class (ag,
sets of typical general-knowledge questions). The notion of mis-
leading items does not capture this distinction, which is both
essential to PMM theory as well as to an explanation for when
and why overconfident subjects can quite realistically estimate
their true relative frequencies of correct responses.
The Brunswikian Perspective
PMM theory draws heavily on the Brunswikian notions of a
natural environment known to an individual, reference classes
in this environment, and representative sampling from a refer-
ence class. We went beyond the Brunswikian focus on achieve-
ment (rather than process) by providing a theoretical frame-
work of the processes that determine choice, confidence, and
frequency judgment.
Choice and confidence are a result of a cue-testing and acti-
vation cycle, which is analogous to Newell and Simon's (1972)
postulate that "problem solving takes place by search in a prob-
lem space" (p. 809). Furthermore, the emphasis on the structure
of the task in PMM theory is similar to Newell and Simon's
proposition that "the structure of the task determines the possi-
ble structures of the problem space" (p. 789). Unlike PMM
theorists, however, Newell and Simon also assumed in the tasks
they studied (cryptarithmetic, logic, and chess) a relatively sim-
ple mapping between the external structure of the task and the
internal representation in a problem space (see Allport, 1975).
Although it is cued by the task structure, we assume that a
PMM (the functional equivalent of a problem space) has a large
surplus structure (the reference class and the cues), which is
taken from a known structure in the problem solver's natural
environment. The emphasis on the structure of everyday knowl-
edge or environment (as distinguished from the task environ-
ment) has been most forcefully defended by Brunswik (see Gi-
gerenzer, 1987). Although Newell and Simon (1972, p. 874)
called Brunswik and Tolman "the real fore runners" of their
work, they seem not to distinguish clearly between the notions
of a probabilistic everyday environment and a task environ-
ment. This theory is an attempt to combine both views. Bruns-
wik's focus on achievement (during his behavioristic phase; see
Leary, 1987) corresponds more closely to the part of research on
probabilistic judgment that focuses on calibration, rather than
on the underlying cognitive processes.
The importance of the cognitive representation of the task
was studied by the Wiirzburg school and emphasized in Gestalt
theoretical accounts of thinking (e.g., Duncker, 19 35/1945), and
this issue has recently regained favor (e.g, Brehmer, 1988; Ham-
mond, Stewart, Brehmer, & Steinman, 1975). In their review,
Einhora and Hogarth (1981) emphasized that "the cognitive
approach has been concerned primarily with how tasks are rep-
resented. The issue of why tasks are represented in particular
ways has not yet been addressed" (p. 57). PMM theory ad-
dresses this issue. Different tasks, such as confidence and fre-
quency tasks, cue different reference classes and different proba-
bility cues from known environments. It is these environments
that provide the particular representation, the PMM, of a task.
Many parts of PMM theory need further expansion, develop-
ment, and testing. Open issues include the following: (a) What
reference class is activated? For city comparisons, this question
has a relatively clear answer, but in general, more than one
reference class can be constructed to solve a problem, (b) Are
cues always generated according to their rank in the cue validity
hierarchy? Alternative models of cue generation could relax this
strong assumption, assuming, for instance, that the first cue
generated is the cue activated in the last problem. The latter
would, however, decrease the percentage of correct answers, (c)
What are the conditions under which we may expect PMMs to
be well adapted? There exists a large body of neo-Brunswikian
research that, in general, indicates good adaptation but also
points out exceptions (e.g, K. Armelius, 1979; Brehmer & Joyce,
1988; Bjorkman, 1987; Hammond & Wascoe, 1980). (d) What
are the conditions under which cue substitution without cue
integration is superior to multiple cue integration? PMM theory
assumes a pure cue substitution model—a cue that cannot be
activated can be replaced by any other cue—without integra-
tion of two or more cues. We focused on the substitution and
not the integration aspect of Brunswik's vicarious functioning
(see Gigerenzer & Murray, 1987, pp. 66-81), in contrast to the
multiple regression metaphor of judgment. Despite its simplic-
ity, the substitution model produces zero overconfidence and a
large number of correct answers, if the PMM is well adapted.
There may be more reasons for simple substitution models.
B. Armelius and Armelius (1974), for instance, reported that
subjects were well able to use ecological validities, but not the
correlations between cues. If the latter is the case, then multiple
cue integration may not work well.
We briefly indicate here that features emphasized by PMM
theory, such as representative sampling and the confidence-fre-
quency distinction, can also be crucial for probabilistic reason-
ing in other tasks.
In several Bayesian-type studies of revision of belief, repre-
PROBABILISTIC MENTAL MODELS 525
sentative (random) sampling from a reference class is a crucial
issue. For instance, Gigerenzer, Hell, and Blank (1988) showed
that subjects' neglect of base rates in Kahneman and Tversky's
(1973) engineer-lawyer problem disappeared if subjects could
randomly draw the descriptions from an urn. Similar results
showing people's sensitivity to the issue of representative versus
selected sampling have been reported by Cosmides and Tooby
(1990), Ginossar and Trope (1987), Grether (1980), Hansen and
Donoghue (1977), and Wells and Harvey (1977), but see Nisbett
and Borgida (1975).
This study has also demonstrated that judgments of single
events can systematically differ from judgments of relative fre-
quencies. Similar differences were found for other kinds of prob-
abilistic reasoning (Gigerenzer, 1991 a, 1991 b). For instance, the
"conjunction fallacy" has been established by asking subjects
the probabilities of single events, such as whether "Linda" is
more likely to be (a) a bank teller or (b) a bank teller and active
in the feminist movement. Most subjects chose the latter, be-
cause the description of Linda was constructed to be representa-
tive of an active feminist. This judgment was called a conjunc-
tion fallacy because the probability of a conjunction (bank
teller and feminist) is never larger than the probability of one of
its constituents. As in the engineer-lawyer problem, the repre-
sentativeness heuristic was proposed to explain the "fallacy."
Fiedler (1988) and Tversky and Kahneman (1983), however,
showed that the conjunction fallacy largely disappeared if peo-
ple were asked for frequencies (e.g, "There are 100 persons like
Linda. How many of them are. . . ?") rather than probabilities
of single events. Cosmides and Tooby (1990) showed a similar
striking difference for people's reasoning in a medical probabil-
ity revision problem. The subjects' task was to estimate the
probability that people have a disease, given a positive test re-
sult, the base rate of the disease, the false-alarm rate, and the hit
rate of the test. Originally, Casscells, Schoenberger, and Gray-
boys (1978) reported only 18% Bayesian answers when Harvard
medical students and staff were asked for a single-event proba-
bility (What is the probability that a person found to have a
positive result actually has the disease?). When Cosmides and
Tooby changed the task into a frequency task (How many peo-
ple who test positive will actually have the disease?), 76% of
subjects responded with the Bayesian answer. These results sug-
gest that the mental models subjects construe to solve these
reasoning problems were highly responsive to information cru-
cial for probability and statistics—random versus selected sam-
pling and single events versus frequencies in the long run.
Js Overconfidence a Bias According to Probability Theory?
Throughout this article, we have avoided classifying judg-
ments as either rational or biased, but instead focused on the
underlying cognitive processes and how these explain extant
data. Overconfidence is, however, usually classified as a bias
and dealt with in chapters on "cognitive illusions" (e.g, Edwards
& von Winterfeldt, 1986). Is Overconfidence a bias according to
probability theory?
Mathematical probability emerged around 1660 as a Janus-
faced concept with three interpretations: observed frequencies
of events, equal possibilities based on physical symmetry, and
degrees of subjective certainty or belief. Frequencies originally
came from mortality and natality data, sets of equiprobable
outcomes from gambling, and the epistemic sense of belief pro-
portioned to evidence from courtroom practices (Daston,
1988). Eighteenth-century mathematicians used "probability"
in all three senses, whereas latter-day probabilists drew a bold
line between the first 2 "objective" senses and the third "subjec-
tive" one. Today, mathematicians, statisticians, and philoso-
phers are still wrangling over the proper interpretation of proba-
bility: Does it mean a relative frequency, a propensity, a degree
of belief, a degree of evidentiary confirmation, or yet some-
thing else? Prominent thinkers can still be found in every camp,
and it would be bold unto foolhardy to claim that any interpre-
tation had a monopoly on reasonableness (Gigerenzer et al.,
1989).
Overconfidence is defined as the difference between degrees
of belief (subjective probabilities) and a relative frequency (per-
centage correct). Is a deviation between the probability that a
particular answer is correct and the relative frequency of
correct answers a bias or error, according to probability theory?
From the point of view of dedicated frequentists such as von
Mises (1928/1957) and Neyman (1977), it is not. According to
the frequentist interpretation (which is the dominant interpre-
tation in statistics departments today), probability theory is
about relative frequencies in the long run; it does not deal with
degrees of beliefs concerning single events. For instance, when
speaking of "the probability of death":
[One] must not think of an individual, but of a certain class as awhole, e.g., "all insured men forty-one years old living in a givencountry and not engaged in certain dangerous occupations".. . .The phrase "probability of death," when it refers to a single per-son, has no meaning at all for us. (von Mises, 1928/1957, p. 11)
For a frequentist, one cannot properly speak of a probability
until a reference class has been defined. The statistician Bar-
nard (1979), for instance, suggested that if one is concerned
with the subjective probabilities of single events, such as confi-
dence, one "should concentrate on the works of Freud and per-
haps Jung rather than Fisher and Neyman" (p. 171). Thus, for
frequentists, probability theory does not apply to single-event
judgments like confidences, and therefore no statement about
confidences can violate probability theory
Moreover, even subjectivists would not generally think of a
deviation between probabilities for single events and relative
frequencies as a bias. The problem is whether and when a subjec-
tivist, who rejects the identification of probability with objec-
tive frequency, should nonetheless make frequency the yard-
stick of good reasoning (for a discussion of conditions, see Ka-
dane & Lichtenstein, 1982). The subjectivist Bruno de Finetti,
for instance, emphatically stated in his early work that subjec-
tive probabilities of single events cannot be validated by objec-
tive probabilities:
However an individual evaluates the probability of a particularevent, no experience can prove him right, or wrong; nor in gen-eral, could any conceivable criterion give any objective sense to thedistinction one would like to draw, here, between right and wrong,(de Finetti, 1931/1989, p. 174)
We thus have to face a problem: Many cognitive psychologists
think of Overconfidence as a bias of reasoning, pointing to prob-
526 G. GIGERENZER, U. HOFFRAGE, AND H. KLEINBOLTING
ability theory as justification. Many probabilisls and statisti-
cians, however, would reply that their interpretation of probabil-
ity does not justify this label (see Hacking, 1965; Lad, 1984;
Stegmiiller, 1973).
PMM theory can offer a partial solution to this problem.
First, it clarifies the distinction between confidence and fre-
quency judgment and therewith directs attention to the compar-
ison between estimated and true frequencies of correct an-
swers. The latter avoids the previously stated problem. This
comparison has not received much attention in research on
confidence in knowledge. Second, PMM theory proposes a fre-
quentist interpretation of degrees of belief: Both confidence
and frequency judgments are based on memory about frequen-
cies. Our view links both types of judgment but does not equate
them. Rather, it specifies when to expect confidence and fre-
quency judgments to diverge, and in what direction, and when
they will converge. PMM theory integrates single-event proba-
bilities into a frequentist framework: the Bayesian is Bruns-
wikian.
Conclusions
We conjecture that confidence in one's knowledge of the kind
studied here—immediate and spontaneous rather than a prod-
uct of long-term reflection—is largely determined by the struc-
ture of the task and the structure of a corresponding, known
environment in a person^ long-term memory. We provided ex-
perimental evidence for this hypothesis by showing how
changes in the task (confidence vs. frequency judgment) and in
the relationship between task and environment (selected vs. rep-
resentative sampling) can make the two stable effects reported
in the literature—overconfidence and the hard-easy effect—
emerge, disappear, and invert at will. We have demonstrated a
new phenomenon, the confidence-frequency effect. One can-
not speak of a general overconfidence bias anymore, in the
sense that it relates to deficient processes of cognition or moti-
vation. In contrast, subjects seem to be able to make fine con-
ceptual distinctions—confidence versus frequency—of the
same kind as probabilists and statisticians do. Earlier attempts
postulating general deficiencies in information processing or
motivation cannot account for the experimental results pre-
dicted by PMM theory and confirmed in two experiments.
PMM theory seems to be the first theory in this field that gives
a coherent account of these various effects by focusing on the
relation between the structure of the task, the structure of a
corresponding environment, and a PMM.
References
Allport, D. A. (1975). The state of cognitive psychology. Quarterly Jour-
nal of Experimental Psychology, 27, 141-152.
Allwood, C. M, & Montgomery, H. (1987). Response selection strate-
gies and realism of confidence judgments. Organizational Behavior
and Human Decision Processes, 39, 365-383.
Anderson, N. H. (1986). A cognitive theory of judgment and decision.In B. Brehmer, H. Jungerraann, P. Lourens, & G. Sev6n (Eds.), New
directions in research on decision making (pp. 63-108). Amsteidam:
North-Holland.Angele, U, Beer-Binder, B., Berger, R, Bussmann, C, KleinbOlting, H,
& Mansard, B. (1982). Uber- undUnterschtitzungdeseigenen Wissens
in Abhangigkeit von Geschlecht und Bildungsstand. [Overestimation
and underestimation of one's knowledge as a function of sex and
education]. Unpublished manuscript, University of Konstanz, Con-
stance, Federal Republic of Germany
Arkes, H. R., Christensen, C, Lai, C, & Blumer, C. (1987). Two meth-
ods of reducing overconfidence. Organizational Behavior and Hu-
man Decision Processes, 39,133-144.
Arkes, H. R., & Hammond, K. R. (Eds.). (1986). Judgment and decision
making: An interdisciplinary reader. Cambridge, England: Cam-
bridge University Press.
Armelius. E, & Armelius, K. (1974). The use of redundancy in multi-ple-cue judgments: Data from a suppressor-variable-task. American
Journal of Psychology, 87, 385-392.
Armelius, K. (1979). Task predictability and performance as determi-
nants of confidence in multiple-cue judgments. Scandinavian Jour-nal of Psychology, 20,19-25.
Barnard, G. A. (1979). Discussion of the paper by Professors Lindley
andTverskyand Dr. Brown. Journal of the Royal Statistical Society of
London, 142 (Series A), 171-172.
Bjorkman, M. (1987). A note on cue probability learning: what condi-
tioning data reveal about cue contrast. Scandinavian Journal of Psy-
chology, 28, 226-232.
Brehmer, B. (1988). The development of social judgment theory. In B.
Brehmer & C. R. B. Joyce (Eds.), Human judgment: The SJT view
(pp. 13-40). Amsterdam: North-Holland.
Brehmer, B., & Joyce, C. R. B. (Eds.). (1988). Human judgment: TheSJT view. Amsterdam: North-Holland.
Brunswik, E. (1943). Organismic achievement and environmental
probability. Psychological Review, 50, 255-272.
Brunswik, E. (1955). Representative design and probabilistic theory ina functional psychology Psychological Review, 62,193-217.
Brunswik, E. (1964). Scope and aspects of the cognitive problem. In
Contemporary approaches to cognition (pp. 4-31). Cambridge, MA:
Harvard University Press.
CassceHs, W, Schoenberger, A., & Grayboys, T. (1978). Interpretation
by physicians of clinical laboratory results. New England Journal of
Medicine, 299, 999-1000.
Cohen, L. J. (1989). The philosophy of induction and probability. Oxford,
England: Clarendon Press.
Cosmides, L, & Tooby, J. (1990, August). Is the mind a frequentist?Paper presented at the 31st Annual Meeting of the Psychonomics
Society, New Orleans, LA.
Daston, L. J. (1988). Classical probability in the Enlightenment. Prince-
ton, NJ: Princeton University Press.
Dawes, R. M. (1980). Confidence in intellectual judgments vs. confi-
dence in perceptual judgments. In E. D. Lantermann & H. Feger
(Eds.), Similarity and choice: Papers in honor of Clyde Coombs (pp.
327-345). Bern, Switzerland: Huber.
de Finetti, B. (1989). Probabilism. Erkemtnis, 31,169-223. (Original
work published 1931)
Duncker, K. (1945). On problem solving (L. S. Lees, Trans.). Psychologi-
cal Monographs, 58(5, Whole No. 270). (Original work published1935)
Edwards, W, & von Winterfeldt, D. (1986). On cognitive illusions andtheir implications. In H. R. Arkes & K. R. Hammond (Eds.), Judg-
ment and decision making: An interdisciplinary reader (pp. 642-679).
Cambridge, England: Cambridge University Press.
Einhorn, H. I, & Hogarth, R. M. (1981). Behavioral decision theory:
Processes of judgment and choice. Annual Review of Psychology, 32,
53-88.
Fiedler, K. (1988). The dependence of the conjunction fallacy on subtle
linguistic factors. Psychological Research. 50,123-129.
Fischhoff, B. (1982). Debiasing. In D. Kahneman, P. Slovic, & A.
PROBABILISTIC MENTAL MODELS 527
Tversky (Eds.), Judgment under uncertainty: Heuristics and biases
(pp. 422-444). Cambridge, England: Cambridge University Press.
Fischhoff, B., & MacGregor, n (1982). Subjective confidence in fore-
casts. Journal of forecasting, 1,155-172.
Fischhoff, B., Slovic, P., & Lichtenstein, S. (1977). Knowing with cer-
tainty: The appropriateness of extreme confidence. Journaloj'Exper-
imental Psychology: Human Perception and Performance, 3, 552-
564.
Fraisse, P. (1964). The psychology of time. London: Eyre & Spottis-
woode.
Gigerenzer, G. (1984). External validity of laboratory experiments:
The frequency-validity relationship. American Journal of Psychol-
ogy, 97,185-195.
Gigerenzer, G. (1987). Survival of the fittest probabilist: Brunswik,
Thurstone, and the two disciplines of psychology. In L. Kruger, G.
Gigerenzer, & M. S. Morgan (Eds.), The probabilistic revolution, Vol.
2: Ideas in the sciences. Cambridge, MA: MIT Press.
Gigerenzer, G. (1991 a). From tools to theories: A heuristic of discovery
in cognitive psychology. Psychological Review, 98, 254-267.
Gigerenzer, G. (1991b). How to make cognitive illusions disappear:
Beyond "heuristics and biases". European Review of Social Psychol-
ogy, 2, 83-115.
Gigerenzer, G., Hell, W, & Blank, H. (1988). Presentation and content:
The use of base rates as a continuous variable. Journal of Experimen-tal Psychology: Human Perception and Performance, 14, 513-525.
Gigerenzer, G, & Murray, D. J. (1987). Cognition as intuitive statistics.Hillsdale, NJ: Erlbaum.
Gigerenzer, G, & Richter, H. R. (1990). Context effects and their inter-
action with development: Area judgments. Cognitive Development,
5, 235-264.
Gigerenzer, G., Swijtink, Z, Porter, T, Daston, L. J., Beatty, J., &
Kruger, L. (1989). The empire of chance. How probability changed
science and everyday life. Cambridge, England: Cambridge Univer-
sity Press.
Ginossar, Z., & Trope, Y (1987). Problem solving in judgment under
uncertainty. Journal of Personality and Social Psychology, 52, 464-
474.
Grether, D. M. (1980). Bayes rule as a descriptive model: The represen-tativeness heuristic. The Quarterly Journal of Economics, 95, 537-
557.
Hacking, I. (1965). Logic of statistical inference. Cambridge, England:
Cambridge University Press.
Hammond, K. R., Stewart, T. R., Brehmer, B., & Steinmann, D. O.
(1975). Social judgment theory. In M. F Kaplan & S. Schwartz (Eds),
Human judgment and decision processes. San Diego, CA: Academic
Press.
Hammond, K. R., & Wascoe, N. E. (Eds.) (1980). Realizations of
Brunswik's representative design. New Directions for the Methodol-ogy of Social and Behavioral Science, 3, 271-312.
Hansen, R. D., & Donoghue, J. M. (1977). The power of consensus:
Information derived from one's own and others' behavior. Journal of
Personality and Social Psychology, 35, 294-302.
Hasher, L., Goldstein, D., & Toppino, T. (1977). Frequency and theconference of referential validity. Journal ofVerbal Learning and Ver-
bal Behavior, /«, 107-112.
Hasher, L., & Zacks, R. T. (1979). Automatic and effortful processes in
memory. Journal of Experimental Psychology: General, 108, 356-
388.
Hintzman, D. L., Nozawa, G., & Irmscher, M. (1982). Frequency as a
nonpropositional attribute of memory. Journal ofVerbal Learning
and Verbal Behavior. 21,127-141.
Howell, W C, & Burnett, S. (1978). Uncertainty measurement: A cog-nitive taxonomy. Organizational Behavior and Human Performance,
22, 45-68.
Johnson-Laird, P. N. (1983). Mental models. Cambridge, MA: Harvard
University Press.
Juslin, P. (1991). Well-calibrated general knowledge: An ecological induc-
tive approach to realism ofconftdence. Manuscript submitted for pub-
lication. Uppsala, Sweden.
Kadane, J. B., & Lichtenstein, S. (1982). A subjectivist view of calibration
(Rep. No. 82-86). Eugene, OR: Decision Research.
Kahneman, D., &Tversky, A. (1973). On the psychology of prediction.
Psychological Review SO, 237-251.
Kahneman, D, & Tversky, A. (1982). On the study of statistical intu-
itions. In D. Kahneman, P. Slovic, & A. Tversky (EdsJ, Judgment
under uncertainty: Heuristics and biases (pp. 493-508). Cambridge,
England: Cambridge University Press.
Keren, G. (1987). Facing uncertainty in the game of bridge: A calibra-
tion study. Organizational Behavior and Human Decision Processes,
39, 98-114.
Keren, G. (1988). On the ability of monitoring non-veridical percep-
tions and uncertain knowledge: Some calibration studies. Ada Psy-
chologica.67,95-119.
Koriat, A, Lichtenstein, S., & Fischhoff, B. (1980). Reasons for confi-
dence. Journal of Experimental Psychology: Human Learning and
Memory, 6,107-118.Kyburg, H. E. (1983). Rational belief. Behavioral and Brain Sciences, 6,
231-273.
Lad, F. (1984). The calibration question. British Journal of the Philo-
sophy of Science, 35, 213-221.
Leary, D. E. (1987). From act psychology to probabilistic functiona-
lism: The place of Egon Brunswik in the history of psychology. In
M. G. Ash & W R. Woodward (Eds.), Psychology in twentieth-century
thought and society (pp. 115-142). Cambridge, England: Cambridge
University Press.
Lichtenstein, S., & Fischhoff, B. (1977). Do those who know more also
know more about how much they know? The calibration of probabil-
ity judgments. Organizational Behavior and Human Performance,
20,159-183.
Lichtenstein, S., & Fischhoff, B. (1981). The effects of gender and in-
struction on calibration (Tech. Rep. No. PTR-1092-8I-7). Eugene,
OR: Decision Research.
Lichtenstein, S., Fischhoff, B., & Phillips, L. D. (1982). Calibration of
probabilities: The state of art to 1980. In D. Kahneman, P. Slovic, &
A. Tversky (Eds.), Judgment under uncertainty: Heuristics and biases
(pp. 306-334). Cambridge, England: Cambridge University Press.
MacGregor, D.,& Slovic, P. (1986). Perceived acceptabilityofrisk analy-
sis as a decision-making approach. Risk Analysis, 6, 245-256.
May, R. S. (1986). Overconfidence as a result of incomplete and wrong
knowledge. In R. W Scholz (Ed.), Current issues in West German
decision research (pp. 13-30). Frankfurt am Main, Germany: Lang.
May, R. S. (1987). Realismus von subjektiven Wahrscheinlichkeiten: Eine
kognitionspsychologische Analyse inferentieller Prozesse beim Over-
confidence-Phanomen [Calibration of subjective probabilities: A cog-
nitive analysis of inference processes in overconfidence) Frankfurt,
Federal Republic of Germany: Lang.
Mayseless, O, & Kruglanski, A. W (1987). What makes you so sure?
Effects of epistemic motivations on judgmental confidence. Organi-zational Behavior and Human Decision Processes, 39,162-183.
Newell, A., & Simon, H. A. (1972). Human problem solving. Englewood
Cliffs, NJ: Prentice-Hall.
Neyman, J. (1977). Frequentist probability and frequentist statistics.
Synthese. 36, 97-131.
Nisbett, R. E., & Borgida, E. (1975). Attribution and the psychology of
prediction. Journal of Personality and Social Psychology, 32, 932-
943.
Parducci, A. (1965). Category judgment: A range-frequency model.Psychological Review, 72, 407-418.
528 G. GIGERENZER, U HOFFRAGE, AND H. KLEINBOLTING
Ronis, D. L., & Yates, J. F. (1987). Components of probability judgmentaccuracy: Individual consistency and effects of subject matter and
assessment method. Organizational Behavior and Human Decision
Processes, 40,193-218.
Rosen, E. (1978). Principles of categorization. In E. Rosen & B. B.
Lloyd (Eds.), Cognition andcategorization(pp. 27-48). Hillsdale, NJ:Erlbaum.
Schdnemann, P. H. (1983). Some theory and results for metrics forbounded response scales. Journal of Mathematical Psychology, 27,
311-324.
Shafer, G. (1978). Non-additive probabilities in the work of BernoulliandLsmbert. Archive for the History of Exact Sciences, 19,309-370.
Sivyer, M., & Finlay, EH (1982). Perceived duration of auditory se-
quences. Journal of General Psychology, 107, 209-217.
Statistisches Bundesamt. (1986). Statistisches Jahrbuch 19S6 fur die
Bundesrepublik Deutschland[Statistical yearbook 1986 for the Fed-eral Republic of Germany]. Stuttgart, Germany: Kohlhammer.
Ste%m\i[lei,'W.(\913).ProbkmeundJ{esultatederWissenschaftstheorie
und Analylischen Philosophic. Ed. IV: Personelle and Statistische
Wahrscheinlichkeit. Teil E. [Problems and results of philosophy ofscience and analytical philosophy Vol. IV: Personal and statisticalprobability. Part E] Berlin, Federal Republic of Germany: Springer.
Tanner, W P., Jr., & Swets, J. A. (1954). A decision-making theory ofvisual detection. Psychological Review, 61, 401-409.
Teigen, K. H. (1974). Overestimation of subjective probabilities. Scan-dinavian Journal of Psychology, IS, 56-62.
Tversky, A, & Kahneman, D. (1983). Extensional versus intuitive rea-
soning: The conjunction fallacy in probability judgment. Psychologi-cal Review, 90, 293-315.
von Mises, R. (1928). Wahrscheinlichkeit, Statistik und Wahrheit. Ber-
lin, Germany: Springer. (Translated and reprinted as Probability sta-tistics, and truth. New York: Dover, 1957)
von Winterfeldt, D., & Edwards, W (1986). Decision analysis and behav-ioral research. Cambridge, England: Cambridge University Press.
Wagenaar, W, & Keren, G. B. (1985). Calibration of probability assess-ments by professional blackjack dealers, statistical experts, and lay
people. Organizational Behavior and Human Decision Processes, 36,406-416.
Wells, G. L, & Harvey, J. H. (1977). Do people use consensus informa-tion in making causal attributions? Journal of Personality and Social
Psychology, 35. 279-293.Zacks, R. T, Hasher, L., & Sanft, H. (1982). Automatic encoding of
event frequency: Further findings. Journal of Experimental Psychol-ogy: Learning, Memory, andCognition, S, 106-116.
Zimmer, A. C. (1983). Verbal vs. numerical processing by subjectiveprobabilities. In R. W Scholz (Ed.), Decision making under uncer-
tainty (pp. 159-182). Amsterdam: North-Holland.Zimmer, A. C. (1986). What uncertainty judgments can tell about the
underlying subjective probabilities. Uncertainty in Artificial Intelli-gence, 249-258.
Received March 12,1990
Revision received December 18,1990
Accepted December 30,1990 •
1992 APA Convention "Call for Programs"
The "Call for Programs" for the 1992 APA annual convention will be included in the October
issue of the APA Monitor. The 1992 convention will be held in Washington, DC, from August 14
through August 18. Deadline for submission of program and presentation proposals is De-
cember 13, 1991. Additional copies of the "Call" will be available from the APA Convention
Office in October.