14
Overview Bayesian learning theory applied to human cognition Robert A. Jacobs 1and John K. Kruschke 2 Probabilistic models based on Bayes’ rule are an increasingly popular approach to understanding human cognition. Bayesian models allow immense representational latitude and complexity. Because they use normative Bayesian mathematics to process those representations, they define optimal performance on a given task. This article focuses on key mechanisms of Bayesian information processing, and provides numerous examples illustrating Bayesian approaches to the study of human cognition. We start by providing an overview of Bayesian modeling and Bayesian networks. We then describe three types of information processing operations—inference, parameter learning, and structure learning—in both Bayesian networks and human cognition. This is followed by a discussion of the important roles of prior knowledge and of active learning. We conclude by outlining some challenges for Bayesian models of human cognition that will need to be addressed by future research. 2010 John Wiley & Sons, Ltd. WIREs Cogn Sci 2011 2 8–21 DOI: 10.1002/wcs.80 INTRODUCTION C omputational modeling of human cognition has focused on a series of different formalisms over recent decades. In the 1970s, production systems were considered a methodology that would unite the studies of human and machine intelligence. In the 1980s and 1990s, connectionist networks were thought to be a key to understanding how cognitive information processing is performed by biological nervous systems. These formalisms continue to yield valuable insights into the processes of cognition. Since the 1990s, however, probabilistic models based on Bayes’ rule have become increasingly popular, per- haps even dominant in the field of cognitive science. Importantly, Bayesian modeling provides a unifying framework that has made important contributions to our understanding of nearly all areas of cognition, including perception, language, motor control, reasoning, learning, memory, and development. In this article, we describe some advantages offered by Bayesian modeling relative to previous formalisms. In particular, Bayesian models allow Correspondence to: [email protected] 1 Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY, USA 2 Department of Psychological and Brain Sciences, Indiana Univer- sity, Bloomington, IN, USA DOI: 10.1002/wcs.80 immense representational latitude and complexity, and use normative Bayesian mathematics to process those representations. This representational complex- ity contrasts with the relative simplicity of nodes in a connectionist network, or if-then rules in a production system. In this article, we focus on key mechanisms of Bayesian information processing, and we provide examples illustrating Bayesian approaches to human cognition. The article is organized as follows. We start by providing an overview of Bayesian model- ing and Bayesian networks. Next, we describe three types of information processing operations found in both Bayesian networks and human cognition. We then discuss the important role of prior knowledge in Bayesian models. Finally, we describe how Bayesian models naturally address active learning, which is a behavior that other formalisms may not address so transparently. BAYESIAN MODELING Are people rational? This is a complex question whose answer depends on many factors, including the task under consideration and the definition of the word ‘rational’. A common observation of cognitive scientists is that we live in an uncertain world, and rational behavior depends on the ability to process information effectively despite ambiguity or uncertainty. Cognitive scientists, therefore, need 8 2010 John Wiley & Sons, Ltd. Volume 2, January/February 2011

Bayesian learning theory applied to human cognitionkruschke/articles/JacobsK2010.pdf · Overview Bayesian learning theory applied to human cognition Robert A. Jacobs1∗ and John

Embed Size (px)

Citation preview

Page 1: Bayesian learning theory applied to human cognitionkruschke/articles/JacobsK2010.pdf · Overview Bayesian learning theory applied to human cognition Robert A. Jacobs1∗ and John

Overview

Bayesian learning theory appliedto human cognitionRobert A. Jacobs1∗ and John K. Kruschke2

Probabilistic models based on Bayes’ rule are an increasingly popular approach tounderstanding human cognition. Bayesian models allow immense representationallatitude and complexity. Because they use normative Bayesian mathematics toprocess those representations, they define optimal performance on a given task.This article focuses on key mechanisms of Bayesian information processing, andprovides numerous examples illustrating Bayesian approaches to the study ofhuman cognition. We start by providing an overview of Bayesian modeling andBayesian networks. We then describe three types of information processingoperations—inference, parameter learning, and structure learning—in bothBayesian networks and human cognition. This is followed by a discussion ofthe important roles of prior knowledge and of active learning. We conclude byoutlining some challenges for Bayesian models of human cognition that will needto be addressed by future research. 2010 John Wiley & Sons, Ltd. WIREs Cogn Sci 2011 28–21 DOI: 10.1002/wcs.80

INTRODUCTION

Computational modeling of human cognition hasfocused on a series of different formalisms over

recent decades. In the 1970s, production systemswere considered a methodology that would unitethe studies of human and machine intelligence. Inthe 1980s and 1990s, connectionist networks werethought to be a key to understanding how cognitiveinformation processing is performed by biologicalnervous systems. These formalisms continue to yieldvaluable insights into the processes of cognition. Sincethe 1990s, however, probabilistic models based onBayes’ rule have become increasingly popular, per-haps even dominant in the field of cognitive science.Importantly, Bayesian modeling provides a unifyingframework that has made important contributionsto our understanding of nearly all areas of cognition,including perception, language, motor control,reasoning, learning, memory, and development.

In this article, we describe some advantagesoffered by Bayesian modeling relative to previousformalisms. In particular, Bayesian models allow

∗Correspondence to: [email protected] of Brain and Cognitive Sciences, University ofRochester, Rochester, NY, USA2Department of Psychological and Brain Sciences, Indiana Univer-sity, Bloomington, IN, USA

DOI: 10.1002/wcs.80

immense representational latitude and complexity,and use normative Bayesian mathematics to processthose representations. This representational complex-ity contrasts with the relative simplicity of nodes in aconnectionist network, or if-then rules in a productionsystem. In this article, we focus on key mechanismsof Bayesian information processing, and we provideexamples illustrating Bayesian approaches to humancognition. The article is organized as follows. Westart by providing an overview of Bayesian model-ing and Bayesian networks. Next, we describe threetypes of information processing operations found inboth Bayesian networks and human cognition. Wethen discuss the important role of prior knowledge inBayesian models. Finally, we describe how Bayesianmodels naturally address active learning, which is abehavior that other formalisms may not address sotransparently.

BAYESIAN MODELINGAre people rational? This is a complex questionwhose answer depends on many factors, includingthe task under consideration and the definitionof the word ‘rational’. A common observation ofcognitive scientists is that we live in an uncertainworld, and rational behavior depends on the abilityto process information effectively despite ambiguityor uncertainty. Cognitive scientists, therefore, need

8 2010 John Wiley & Sons, L td. Volume 2, January /February 2011

Page 2: Bayesian learning theory applied to human cognitionkruschke/articles/JacobsK2010.pdf · Overview Bayesian learning theory applied to human cognition Robert A. Jacobs1∗ and John

WIREs Cognitive Science Bayesian learning theory

methods for characterizing information and theuncertainty in that information. Fortunately, suchmethods are available—probability theory provides acalculus for representing and manipulating uncertaininformation. An advantage of Bayesian modelsrelative to many other types of models is that theyare probabilistic.

Probability theory does not provide just anycalculus for representing and manipulating uncer-tain information, it provides an optimal calculus.1

Consequently, an advantage of Bayesian modeling isthat it gives cognitive scientists a tool for definingrationality. Using Bayes’ rule, Bayesian models opti-mally combine information based on prior beliefs withinformation based on observations or data. UsingBayesian decision theory, Bayesian models can usethese combinations to choose actions that maximizethe task performance. Owing to these optimal proper-ties, Bayesian models perform a task as well as the taskcan be performed, meaning that the performance ofa Bayesian model on a task defines rational behaviorfor that task.

Of course, the performance of a model dependson how it represents prior beliefs, observations, andtask goals. That is, the representational assumptionsof a model influence its performance. It is importantto keep in mind that any task can be modeledin multiple ways, each using a different set ofassumptions. One model may assume the use of certaindimensional perceptual representations which arecombined linearly into a probabilistic choice, whereasanother model may assume featural representationscombined conjunctively into a probabilistic choice.But for any specific probabilistic formalization of atask, a Bayesian model specifies optimal performancegiven the set of assumptions made by the model.

This fact leads to an additional advantage ofBayesian modeling relative to many other approaches,namely that the assumptions underlying Bayesianmodels are often explicitly stated as well-definedmathematical expressions and, thus, easy to exam-ine, evaluate, and modify. Indeed, a key reason forusing the Bayesian modeling formalism is that, rel-ative to many other computational formalisms, itallows cognitive scientists to more easily study theadvantages and disadvantages of different assump-tions. (Admittedly, this property is also possessed byother formalisms, especially in the mathematical psy-chology tradition, that have rigorous mathematical orprobabilistic interpretations; see, e.g., Busemeyer andDiederich,2 and many examples in Scarborough andSternberg.3 But such mathematically explicit modelsform only a subset of the spectrum of computationalmodels in cognitive science; cf. Refs 4 and 5.) Through

this study, scientists can ask questions about the natureor structure of a task (Marr6 referred to the analysisof the structure of a task as a ‘computational theory’),such as: What are the variables that need to be takeninto account, the problems that need to be solved, andthe goals that need to be achieved in order to performa task?; Which of these problems or goals are easy orhard?; and Which assumptions are useful, necessary,and/or sufficient for performance of the task?

Let us assume that the performance of aperson on a task is measured, a Bayesian model isapplied to the same task, and the performances ofthe person and the model are equal. This providesimportant information to a cognitive scientist becauseit provides the scientist with an explanation for theperson’s behavior—the person is behaving optimallybecause he or she is using and combining all relevantinformation about the task in an optimal manner. Inaddition, this result supports (but does not prove) thehypothesis that the assumptions used by the modelabout prior beliefs, observations, and task goals mayalso be used by the person.

Alternatively, suppose that the performance ofthe model exceeds that of the person. This resultalso provides useful information. It indicates that theperson is not using all relevant information or notcombining this information in an optimal way. Thatis, it suggests that there are cognitive ‘bottlenecks’preventing the person from performing better.Further experimentation can attempt to identify thesebottlenecks, and training can try to ameliorate orremove these bottlenecks. Possible bottlenecks mightinclude cognitive capacity limitations such as limitson the size of working memory or on the quantity ofattentional resources. Bottlenecks might also includecomputational limitations. Bayesian models oftenperform complex computations in high-dimensionalspaces. It may be that people are incapable ofthese complex computations and, thus, incapable ofperforming Bayesian calculations. Later in this article,we will address the possibility that people are not trulyBayesian, but only approximately Bayesian.

If a cognitive scientist hypothesizes that aperson’s performance on a task is suboptimal becauseof a particular bottleneck, then the scientist candevelop a new model that also contains the positedbottleneck. For instance, if a scientist believes that aperson performs suboptimally because of an inabilityto consider stimuli that occurred far in the past, thescientist can study a model that only uses recentinputs. Identical performances by the new model andthe person lend support to the idea that the person,like the model, is only using recent inputs.

Volume 2, January /February 2011 2010 John Wiley & Sons, L td. 9

Page 3: Bayesian learning theory applied to human cognitionkruschke/articles/JacobsK2010.pdf · Overview Bayesian learning theory applied to human cognition Robert A. Jacobs1∗ and John

Overview wires.wiley.com/cogsci

Finally, suppose that the performance of theperson exceeds that of the model—that is, the person’sperformance exceeds the optimal performance definedby the model—then once again this is a useful result. Itsuggests that the person is using information sourcesor assumptions that are not currently part of themodel. For instance, if a model only considers theprevious word when predicting the next word in asentence, then a person who outperforms the model islikely using more information (e.g., several previouswords) when he or she makes a prediction. A cognitivescientist may consider a new model with additionalinputs. As before, identical performances by the newmodel and the person lend support to the idea thatthe person, like the model, is using these additionalinputs.

In some sense, the Bayesian approach to thestudy of human cognition might seem odd. It is basedon an analysis of what level of task performanceis achievable given a set of assumptions. However,it does not make a strong statement about how aperson achieves that task performance—which men-tal representations and algorithms the person useswhile performing a task. Bayesian models are notintended to provide mechanistic or process accountsof cognition.6,7 For cognitive scientists interested inmechanistic accounts, a Bayesian model often sug-gests possible hypotheses about representations andalgorithms that a person might use while performinga task, but these hypotheses need to be evaluated byother means. Similarly, if a person performs subop-timally relative to a Bayesian model, the model maysuggest hypotheses as to which underlying mecha-nisms are at fault, but these hypotheses are only start-ing points for further investigation. Consequently,Bayesian modeling complements, but does not sup-plant, the use of experimental and other theoreticalmethodologies.

For all the reasons outlined here, Bayesianmodeling has become increasingly important in thefield of cognitive science.6–16 Rather than describingparticular Bayesian models in depth, a main focusof this article is on information processing opera-tions—inference, parameter learning, and structurelearning—found in Bayesian models and human cog-nition. Before turning to these operations, however,we first describe Bayesian networks, a formalismthat makes these operations particularly easy tounderstand.

BAYESIAN NETWORKSWhen a theorist develops a mental model of acognitive domain, it is necessary to first identify

the variables that must be taken into account. Thedomain can then be characterized through the jointprobability distribution of these variables. Manydifferent information processing operations can becarried out by manipulating this distribution.

Although conceptually appealing, joint distri-butions are often impractical to work with directlybecause real-world domains contain many potentiallyrelevant variables, meaning that joint distributions canbe high-dimensional. If a domain contains 100 vari-ables, then the joint distribution is 100-dimensional.Hopefully, it is the case that some variables are inde-pendent (or conditionally independent given the valuesof other variables) and, thus, the joint distribution canbe factored into a product of a small number ofconditional distributions where each conditional dis-tribution is relatively low-dimensional. If so, then it ismore computationally efficient to work with the smallset of low-dimensional conditional distributions thanwith the high-dimensional joint distribution.

Bayesian networks have become popular inartificial intelligence and cognitive science becausethey graphically express the factorization of a jointdistribution.17–19 A network contains nodes, edges,and probability distributions. Each node correspondsto a variable. Each edge corresponds to a relationshipbetween variables. Edges go from ‘parent’ variablesto ‘child’ variables, thereby indicating that the valuesof the parent variables directly influence the valuesof the child variables. Each conditional probabilitydistribution provides the probability of a child variabletaking a particular value given the values of its parentvariables. The joint distribution of all variables isequal to the product of the conditional distributions.For example, we assume that the joint distribution ofvariables A, B, C, D, E, F, and G can be factored asfollows:

p(A, B, C, D, E, F, G) = p(A)p(B)p(C|A)p(D|A, B)

p(E|B)p(F|C)p(G|D, E) (1)

Then the Bayesian network in Figure 1 represents thisjoint distribution.

The parameters of a Bayesian network are theparameters underlying the conditional probability dis-tributions. For example, suppose that the variables ofthe network in Figure 1 are real-valued, and supposethat each variable is distributed according to a nor-mal distribution whose mean is equal to the weightedsum of its parent’s values (plus a bias weight) andwhose variance is a fixed constant [In other words,the conditional distribution of X given the valuesof its parents is a Normal distribution whose mean is∑

i∈pa(X) wXiVi + wXb and whose variance is σ 2X where

10 2010 John Wiley & Sons, L td. Volume 2, January /February 2011

Page 4: Bayesian learning theory applied to human cognitionkruschke/articles/JacobsK2010.pdf · Overview Bayesian learning theory applied to human cognition Robert A. Jacobs1∗ and John

WIREs Cognitive Science Bayesian learning theory

A B

C D E

F G

p(B)p(A)

p(C | A) p(D | A, B) p(E | B)

p(F | C) p(G | D, E)

FIGURE 1 | Bayesian network representing the joint probabilitydistribution of variables A, B, C, D, E, F, and G. A node represents thevalue of the variable it contains. Arrows impinging on a node indicatewhat other variables the value of the node depends on. The conditionalprobability besides each node expresses mathematically what thearrows express graphically.

X is a variable, i indexes the variables that are parentsof X, Vi is the value of variable i, wXi is a weight relat-ing the value of variable i to X, and wXb is a bias weightfor X. In Figure 1, for instance, A ∼ N(wAb, σ 2

A), C ∼N(wCAA + wCb, σ 2

C), and D ∼ N(wDAA + wDBB +wDb, σ 2

D).] Then the weights would be the net-work’s parameters.

INFERENCE

If the values of some variables are observed, then thesedata can be used to update our beliefs about othervariables. This process is referred to as ‘inference’,and can be carried out using Bayes’ rule. For example,suppose that the values of F and G are observed,and we would like to update our beliefs about theunobserved values A, B, C, D, and E. Using Bayes’rule:

p(A, B, C, D, E|F, G)

= p(F, G|A, B, C, D, E)p(A, B, C, D, E)p(F, G)

(2)

where p(A, B, C, D, E) is the prior probability ofA, B, C, D, and E, p(A, B, C, D, E|F, G) is the posteriorprobability of these variables given data F and G,p(F, G|A, B, C, D, E) is the probability of the observeddata (called the likelihood function of A, B, C, D, andE), and p(F, G) is a normalization term referred to asthe evidence.

Inference often requires marginalization. Sup-pose we are only interested in updating our beliefsabout A and B (i.e., we do not care about the values

Rain

Wet grass

Sprinkler

p(W | S, R)

p(S|C) p(R|C)

Cloudyp(C)

FIGURE 2 | Bayesian network characterizing a domain with fourbinary variables indicating whether it is cloudy, the sprinkler wasrecently on, it recently rained, and the grass is wet.

of C, D, and E), this can be achieved by ‘integratingout’ the irrelevant variables:

p(A, B|F, G) =∫∫∫

p(A, B, C, D, E|F, G)dCdDdE

(3)

In some cases, inference can be carried out ina computationally efficient manner using a localmessage-passing algorithm.18 In other cases, inferenceis computationally expensive, and approximationtechniques, such as Markov chain Monte Carlosampling, may be needed.20

Is Bayesian inference relevant to human cogni-tion? We think that the answer is yes. To motivatethis answer, we first consider a pattern of causal rea-soning referred to as ‘explaining away’. Explainingaway is an instance of the ‘logic of exoneration’ (e.g.,if one suspect confesses to a crime, then unaffiliatedsuspects are exonerated). In general, increasing thebelievability of some hypotheses necessarily decreasesthe believability of others.

The Bayesian network illustrated in Figure 2characterizes a domain with four binary variablesindicating whether it is cloudy, whether the sprinklerwas recently on, whether it recently rained, andwhether the grass is wet.18,19 If the weather is cloudy(denoted C = 1), then there is a low probabilitythat the sprinkler was recently on (S = 1) and ahigh probability that it recently rained (R = 1). Ifthe weather is not cloudy, then there is a moderateprobability that the sprinkler was recently on and alow probability that it recently rained. Finally, if thesprinkler was recently on, rain fell, or both, then thereis a high probability that the grass is wet (W = 1).

You walk outside and discover that the grass iswet and that the sprinkler is on. What, if anything,can you conclude about whether it rained recently? Anintuitive pattern of reasoning, and one that seems to

Volume 2, January /February 2011 2010 John Wiley & Sons, L td. 11

Page 5: Bayesian learning theory applied to human cognitionkruschke/articles/JacobsK2010.pdf · Overview Bayesian learning theory applied to human cognition Robert A. Jacobs1∗ and John

Overview wires.wiley.com/cogsci

be exhibited by people, is to conclude that the waterfrom the sprinkler wet the grass and, thus, there isno good evidence suggesting that it rained recently.This pattern is called explaining away because theobservation that the sprinkler is on explains the factthat the grass is wet, meaning that there is no reasonto hypothesize another cause, such as recent rain, forwhy the grass is wet. This type of causal reasoningnaturally emerges from Bayes’ rule. Without goinginto the mathematical details, a comparison of theprobability of recent rain given that the grass is wet,p(R = 1|W = 1), with the probability of recent raingiven that the grass is wet and that the sprinkleris on, p(R = 1|W = 1, S = 1), would show that thelatter value is significantly smaller. Thus, Bayes’ ruleperforms explaining away.

Explaining away illustrates an important advan-tage of Bayesian statistics, namely, that Bayesianmodels maintain and update probabilities for all pos-sible values of their variables. Because probabilitydistributions must sum or integrate to one, if somevalues become more likely, then other values mustbecome less likely. This is true even for variableswhose values are not directly specified in a data set.In the example above, the variables S and R are nega-tively correlated given the value of C. (If it is raining,it is unlikely that a sprinkler will be on. Similarly, if asprinkler is on, it is unlikely to be raining.) Therefore,if a Bayesian observer discovers that the grass is wetand that the sprinkler is on, the observer can rea-sonably accept the hypothesis that the water from thesprinkler wet the grass. In doing so, the observer simul-taneously rejects the hypothesis that it rained recentlyand the rain water wet the grass, despite the fact thatthe observer did not directly obtain any informationabout whether it rained. Thus, a Bayesian observercan simultaneously maintain and update probabilitiesfor multiple competing hypotheses.

Our scenario with sprinklers, rain, and wet grassprovides a simple example of Bayesian inference, butcognitive scientists have studied more complicatedexamples. Often these examples involve predictionand, thus, inference and prediction are closely related.

Calvert et al.21 found that auditory cortex innormal hearing individuals became activated whenthese individuals viewed facial movements associatedwith speech (e.g., lipreading) in the absence of auditoryspeech sounds. It is as if the individuals used theirvisual percepts to predict or infer what their auditorypercepts would have been if auditory stimuli werepresent; that is, they computed p(auditory percept |visual percept).

Pirog Revill et al.22 used brain imaging to showthat people predicted the semantic properties of a

word while lexical competition was in process andbefore the word was fully recognized. For example,the speech input /can/ is consistent with the wordscan, candy, and candle and, thus, can, candy, andcandle are active lexical competitors when /can/ isheard. These investigators found that a neural regiontypically associated with visual motion processingbecame more activated when motion words wereheard than when nonmotion words were heard.Importantly, when nonmotion words were heard, theactivation of this region was modulated by whetherthere was a lexical competitor that was a motionword rather than another nonmotion word. It is asif the individuals used the on-going speech input topredict the meaning of the current word as the speechunfolded over time; i.e., they computed p(semanticfeatures | on-going speech percept).

Blakemore et al.23 used inference to explainwhy individuals cannot tickle themselves. A ‘forwardmodel’ is a mental model that predicts the sensoryconsequences of a movement based on a motorcommand. These investigators hypothesized that whena movement is self-produced, its sensory consequencescan be accurately predicted by a forward model; i.e.,a person computes p(sensory consequences | motormovement), and this prediction is used to attenuatethe sensory effects of the movement. In this way, thesensory effects of self-tickling are dulled.

Although now there is much experimentalevidence for perceptual inference, the functional roleof inference is not always obvious. For example, whyis it important to predict an auditory percept based ona visual percept? In the section below titled ParameterLearning, we review the hypothesis that inference mayplay an important role in learning.

The examples above provide evidence thatpeople infer the values of unobserved variables basedon the values of observed variables. However, they donot show that this inference is quantitatively consistentwith the use of Bayes’ rule. Evidence suggestingthat people’s inferences are indeed quantitativelyconsistent with Bayes’ rule comes from the study ofsensory integration. The Bayesian network in Figure 3characterizes a mental model of an observer whoboth sees and touches the objects in an environment.The nodes labeled ‘scene variables’ are the observer’sinternal variables for representing all conceivablethree-dimensional scenes. As a matter of notation, letS denote the scene variables. Based on the values of thescene variables, haptic feature variables, denoted FH,and visual feature variables, denoted FV, are assignedvalues. For instance, a scene with a coffee mug givesrise to both haptic features, such as curvature andsmoothness, and visual features, such as curvature

12 2010 John Wiley & Sons, L td. Volume 2, January /February 2011

Page 6: Bayesian learning theory applied to human cognitionkruschke/articles/JacobsK2010.pdf · Overview Bayesian learning theory applied to human cognition Robert A. Jacobs1∗ and John

WIREs Cognitive Science Bayesian learning theory

FIGURE 3 | Bayesian network characterizing adomain in which an observer both sees and touchesthe objects in an environment. At the top of thehierarchy, the values of scene variables determinethe probabilities of distal haptic and visual features.The distal haptic and visual features in turndetermine the probabilities of values of proximalhaptic and visual input (sensory) variables.

Scenevariables

Hapticinput

variables

Visualinput

variables

Hapticfeature

variables

Visualfeature

variablesp(FH | S) p(FV | S)

p(ΙH | FH) p(ΙV | FV)

p(S)

and color. The haptic features influence the valuesof the haptic input variables, denoted IH, when theobserver touches the mug. Similarly, the visual featuresinfluence the values of the visual input variables,denoted IV, when the observer views the mug.

The values of the input variables are ‘visible’because the observer directly obtains these values aspercepts arising through touch and sight. However,the feature and scene variables are not directlyobservable and, thus, are regarded as hidden orlatent. The distribution of the latent variables may becomputed by the observer from the values of the visiblevariables using Bayesian inference. For instance, basedon the values of the haptic and visual input variables,the observer may want to infer the properties of thescene. That is, the observer may want to computep(S|IH, IV).

To illustrate how an observer might computethis distribution, we consider a specific instance ofsensory integration. We suppose that an observer bothsees and grasps a coffee mug, and wants to infer thedepth of the mug (i.e., the distance from the front ofthe mug to its rear). Also we can suppose that theobserver’s belief about the depth of the mug givenits visual input has a normal distribution, denotedN(µv, σ 2

v ). Similarly, the observer’s belief about thedepth given its haptic input has a normal distribution,denoted N(µh, σ 2

h ). Then, given certain mathematicalassumptions, it is easy to show (as derived in manypublications; e.g., Ref 24) that the belief about thedepth, given both inputs, has a normal distributionwhose mean is a linear combination of µv and µh:

µv,h = σ−2v

σ−2v + σ−2

h

µv + σ−2h

σ−2v + σ−2

h

µh (4)

and whose variance is given by:

1

σ 2v,h

= 1σ 2

v+ 1

σ 2h

(5)

At an intuitive level, this form of sensoryintegration is appealing for several reasons. First,the variance of the depth distribution based on bothinputs σ 2

v,h is always less than the variances basedon the individual inputs σ 2

v and σ 2h , meaning that

depth estimates based on both inputs are more precisethan estimates based on individual inputs. Second, thisform of sensory integration uses information to theextent that this information is reliable or precise. Thisidea is illustrated in Figure 4. In the top panel, thedistributions of depth given visual inputs and hapticinputs have equal variances (σ 2

v = σ 2h ). That is, the

inputs are equally precise indicators of depth. Themean of the depth distribution based on both inputsis, therefore, an equally weighted average of the meansbased on the individual inputs. In the bottom panel,however, the variance of the distribution based onvision is smaller than the variance based on haptics(σ 2

v < σ 2h ), meaning that vision is a more precise

indicator of depth. In this case, the mean of depthbased on both inputs is also a weighted average of themeans based on the individual inputs, but now theweight assigned to the visual mean is large and theweight assigned to the haptic mean is small.

Do human observers perform sensory integra-tion in a manner consistent with this statisticallyoptimal and intuitively appealing framework? Sev-eral studies have now shown that, in many cases,the answer is yes. Ernst and Banks25 recorded sub-jects’ judgments of the height of a block when theysaw the block, when they grasped the block, andwhen they both saw and grasped the block. Theseinvestigators found that the framework accuratelypredicted subjects’ multisensory judgments based ontheir unisensory judgments. This was true when visualsignals were corrupted by small amounts of noise, inwhich case the visual signal was more reliable, and alsowhen visual signals were corrupted by large amountsof noise, in which case the haptic signal was morereliable. Knill and Saunders26 used the frameworkto predict subjects’ judgments of surface slant when

Volume 2, January /February 2011 2010 John Wiley & Sons, L td. 13

Page 7: Bayesian learning theory applied to human cognitionkruschke/articles/JacobsK2010.pdf · Overview Bayesian learning theory applied to human cognition Robert A. Jacobs1∗ and John

Overview wires.wiley.com/cogsci

Pro

babi

lity

Pro

babi

lity

Depth

Depth

Probability ofdepth givenvisual signal

Probability ofdepth givenhaptic signal

Probability ofdepth givenvisual signal

Probability ofdepth givenhaptic signal

Optimal depthestimate givenboth signals

Optimal depthestimate givenboth signals

FIGURE 4 | Bayesian model of sensory integration. (Top) A situationin which visual and haptic percepts are equally good indicators ofdepth. (Bottom) A situation in which the visual percept is a morereliable indicator of depth.

surfaces were defined by visual stereo and texture cuesbased on their slant judgments when surfaces weredefined by just the stereo cue or just the texture cue.It was found that the framework provided accuratepredictions when surface slants were small, mean-ing that stereo was a more reliable cue, and whenslants were large, meaning that texture was a morereliable cue. Other studies showing that people’s mul-tisensory percepts are consistent with the frameworkinclude Alais and Burr,27 Battaglia et al.,28 Ghahra-mani et al.,29 Jacobs,30 Kording and Wolpert,31 Landyet al.,32 Maloney and Landy,33 and Young et al.34

PARAMETER LEARNING

Parameter learning occurs when one of the conditionaldistributions inside a Bayesian network is adapted.For example, in the Bayesian network on sensoryintegration in Figure 3, let us suppose that anobserver’s probability distribution of visual inputs,given its internal representation of the visual features,is a normal distribution whose mean is equal toa weighted sum of the values of the visual featurevariables. If we use Bayes’ rule to update the posteriordistribution of the weight values based on new data,then this would be an instance of parameter learning.

Do people perform parameter learning inthe same manner as Bayesian networks? Variousaspects of human learning are captured by Bayesianmodels. Perhaps the simplest example is a behavioral

phenomenon known as ‘backward blocking’.35

Suppose that in Stage 1 of an experiment, a personis repeatedly exposed to two cues, denoted C1 andC2, followed by an outcome O. The person willlearn that each cue is at least partly predictive ofthe outcome. That is, the person will act as if thereis a moderate association between C1 and O andalso between C2 and O. In Stage 2, the person isrepeatedly exposed to C1 followed by O (C2 does notappear in Stage 2). After Stage 2, the person will actas if there is a strong association between C1 and O.This is expected because the person has consistentlyseen C1 followed by O. However, the person will alsoact as if there is only a weak association between C2and O. This is surprising because it suggests that theperson has retrospectively revalued C2 by diminishingits associative strength despite the fact that C2 didnot appear in Stage 2 of the experiment.

Backward blocking and explaining away(described earlier in the article) are similar phe-nomena. In both cases, an unobserved variable isrevalued because an observed variable co-occurs withan outcome. Explaining away is regarded as a formof inference because it takes place on a short timescale—the unobserved variable is revalued after asingle co-occurrence of an observed variable and anoutcome. In contrast, backward blocking is regardedas a form of learning because it might take place on arelatively longer time scale of multiple co-occurrencesof an observed variable and an outcome (although itcan also happen in very few exposures). Ultimately,the distinction between inference, as in explainingaway, and learning, as in backward blocking, evap-orates: Both inference and learning are modeled viaBayes’ rule, but applied to variables with differentmeaning to the theorist.

It is challenging for non-Bayesian models toaccount for backward blocking. For example, accord-ing to the well-known Rescorla–Wagner model,36 alearner’s knowledge consists of a single weight value,referred to as an associative strength, for each cue.Learning consists of changing a cue’s weight valueafter observing a new occurrence of the cue and out-come. Because a cue’s weight value changes only ontrials in which the cue occurs, the learning rule cannotaccount for backward blocking which seems to requirea change in C2’s weight value during Stage 2 despitethe fact that C2 did not occur during Stage 2 (forreferences to non-Bayesian models that might addressbackward blocking, see Ref 37).

In contrast, Bayesian learning can accountfor backward blocking because a learner entertainssimultaneously all possible combinations of weightvalues, with a degree of believability for each

14 2010 John Wiley & Sons, L td. Volume 2, January /February 2011

Page 8: Bayesian learning theory applied to human cognitionkruschke/articles/JacobsK2010.pdf · Overview Bayesian learning theory applied to human cognition Robert A. Jacobs1∗ and John

WIREs Cognitive Science Bayesian learning theory

combination.37–40 That is, the learner maintainsa posterior probability distribution over weightcombinations, given the data experienced so far. Inthe context of the backward blocking experiment,this distribution is denoted p(w1, w2|{data}). Bayesianrules can account for backward blocking as follows.After the first stage of training, in which the learnerhas seen cases of C1 and C2 together indicating theoutcome, the learner has some degree of belief ina variety of weight combinations, including weightvalues of w1 = 0.5, w2 = 0.5 (both C1 and C2 arepartially predictive of O), w1 = 1, w2 = 0 (C1 is fullypredictive of O but C2 carries no information aboutO), and w1 = 0, w2 = 1 (C2 is fully predictive of Obut C1 carries no information about O). In Stage 2,C1 alone is repeatedly paired with O and, thus,belief increases in both the combination w1 = 0.5,w2 = 0.5 and w1 = 1, w2 = 0. Because total beliefacross all weight combinations must sum to one, beliefin w1 = 0, w2 = 1 consequently drops. The learner,therefore, decreases the associative strength betweenC2 and O, thereby exhibiting backward blocking.

Backward blocking is representative of only onetype of learning situation, referred to as discriminativelearning. In this situation, a learner’s data consist ofinstances of cues and outcomes. The learner needsto accurately estimate the conditional probabilitydistribution over outcomes given particular valuesof the cues. This conditional distribution is calleda discriminative model, and it allows a learner topredict the likely outcomes based on the given cuevalues. Notice that the learner is not learning thedistribution of cues. In contrast, in a generativelearning situation, the learner learns the jointdistribution of cues and outcomes. A generativemodel has latent variables to describe how underlyingstates generate the distribution of cues and outcomessimultaneously.41 In general, learning with latentvariables is a computationally difficult problem. It isoften approached by using inference to determine theprobability distributions of the latent variables giventhe values of the observed variables (e.g., cues andoutcomes). These distributions over latent variablesare useful because they allow a learner to adapt itsparameter values underlying conditional distributionsof observed variables given values of latent variables.

Consider the problem of visual learning inmultisensory environments. Recall that the Bayesiannetwork in Figure 3 represents the mental model ofan observer that both sees and touches objects in ascene. Also recall that the input variables (IV andIH) are ‘visible’ because the observer obtains thevalues of these variables when he or she touchesand views objects. However, the feature (FV and FH)

and scene variables (S) are not directly observableand, thus, are regarded as hidden or latent. Visuallearning takes place when, for example, the observeradapts the parameter values underlying p(IV|FV), theconditional probability distribution associated withits visual input variables. To adapt these values, theobserver needs information indicating how the valuesshould be modified because FV is an unobservedvariable. How can the observer’s visual system obtainthis information?

Suppose that the observer touches and sees anobject. The observer can infer the distribution ofthe visual feature variables FV in two ways: Sheor he can calculate the conditional distribution ofthese variables given the values of both haptic andvisual input variables p(FV|IH, IV) or given only thevalues of the visual input variables p(FV|IV). The firstdistribution is based on more information and, thus,it can be used as a ‘teaching signal’. That is, theobserver can adapt its visual system so that p(FV|IV) iscloser to p(FV|IH, IV). This example illustrates the factthat multisensory environments are useful for visuallearning because nonvisual percepts can be used byBayes’ rule to infer distributions that can be used bypeople when adapting their visual systems.42

The idea that inference provides importantinformation used during parameter learning lies atthe heart of the expectation-maximization (EM)algorithm, an optimization procedure for maximizinglikelihood functions.43 Consider the Bayesian networkshown in Figure 5. The nodes in the top rowcorrespond to latent variables whose values areunobserved, whereas the nodes in the bottom rowcorrespond to visible variables whose values areobserved. Data sets consist of multiple data itemswhere an item is a set of values of the visible variables.According to the network, data items are generated asfollows. For each data item, the values of the latentvariables are sampled from their prior distributions,and then the values of the visible variables are sampledfrom their conditional distributions given the valuesof the latent variables.

Latentvariables

Visiblevariables

FIGURE 5 | Bayesian network characterizing a domain in which alarge number of visible variables are dependent on a small number ofunobserved latent variables.

Volume 2, January /February 2011 2010 John Wiley & Sons, L td. 15

Page 9: Bayesian learning theory applied to human cognitionkruschke/articles/JacobsK2010.pdf · Overview Bayesian learning theory applied to human cognition Robert A. Jacobs1∗ and John

Overview wires.wiley.com/cogsci

During learning, a learner uses a data setto estimate the parameter values of the Bayesiannetwork. Importantly, learning seems to requiresolving the following chicken-and-egg problem: Inorder to learn the parameter values of the conditionaldistributions, it seems necessary to know the valuesof the latent variables. But in order to infer the valuesof the latent variables, it seems necessary to knowthe parameter values of the conditional distributions.This chicken-and-egg problem is solved by the EMalgorithm. The algorithm is an iterative algorithmwith two steps at each iteration. During the E-step,it uses the values of the visible variables and thecurrent estimates of the parameter values of theconditional distributions to compute the expectedvalues of the latent variables. During the M-step,it uses the values of the visible variables and theexpected values of the latent variables to re-estimatethe parameter values of the conditional distributionsas the values that maximize the likelihood of thedata. (The reader should note that the EM algorithmis not a fully Bayesian algorithm because it findspoint estimates of the parameter values as opposedto probability distributions over parameter values.Nonetheless, the algorithm has played an importantrole in the literature and, thus, we include it here.)

Specific instances of the use of the EM algorithmto learn the parameter values for the Bayesian networkin Figure 5 are commonplace in the cognitive scienceliterature.44 In principal component analysis, datatypically exist in a high-dimensional space (i.e.,there are many visible variables), and the learneraccounts for the data by hypothesizing that dataitems are a linear projection from a lower-dimensionalspace (i.e., there are relatively few latent variables).Orthogonal axes of the lower-dimensional spaceare discovered by analyzing the data’s directions ofmaximum variance in the high-dimensional space.a

Factor analysis is similar to principal componentanalysis, though the learner does not necessarily seekto characterize the lower-dimensional space throughorthogonal axes. Instead, the learner seeks to discoverlatent variables that account for correlations amongthe visible variables. Mixture models in which themixture components are normal distributions alsofit in this framework. Here, the number of latentvariables equals the number of possible clusters orcategories of data items. The learner assumes thateach item was generated by a single latent variableindicating the item’s cluster or category membership.

STRUCTURE LEARNINGParameter learning is not the only type of learning thatcan take place in a Bayesian network. Another type of

learning that can occur is referred to as ‘structurelearning’. In the examples of Bayesian networksdiscussed above, the structure of a network was fixed,where the structure refers to a network’s set of nodes(or set of random variables) and set of directed edgesbetween nodes. Given certain assumptions, however,it is possible to learn a probability distribution overpossible structures of a network from data.

Why are assumptions needed for structurelearning? It is because the space of possible structuresgrows extremely fast. If there are n random variables,then the number of possible structures grows super-exponentially in n [see Ref 45; specifically, the rateof growth is O(n!2n!/(2!(n−2)!))]. Consequently, it isgenerally not computationally possible to search thefull space of possible structures.

Because the full space of structures cannot besearched, it is common to define a small ‘dictionary’of plausible structures. A researcher may decide thatthere are n plausible structures worth considering fora given domain. The researcher may then perform‘model comparison’ by comparing the ability of eachstructure or model to provide an account of thedata. Let {M1, . . . , Mn} denote a set of plausiblemodels. Then the researcher computes the posteriorprobability of each model given the data using Bayes’rule:

p(Mi|{data}) ∝ p({data}|Mi)p(Mi) (6)

The likelihood term p({data}|Mi) can be calculated byincluding the parameter values for a model, denoted θ ,and then by marginalizing over all possible parametervalues:

p({data}|Mi) =∫

p({data}|Mi, θ)p(θ |Mi)dθ (7)

The distribution of the data is provided by a weightedaverage of each model’s account of the data46:

p({data}) =n∑

i=1

p({data}|Mi)p(Mi) (8)

This approach is referred to as ‘model averaging’.Kording et al.47 presented a model-averaging

approach to sensory integration. Consider an envi-ronment containing both visual and auditory stimuli,and an observer that needs to locate events in space.A problem for the observer is to determine whethera visual stimulus and an auditory stimulus originatefrom the same environmental event, in which case thelocations indicated by these stimuli should be inte-grated, or if the visual and auditory stimuli originate

16 2010 John Wiley & Sons, L td. Volume 2, January /February 2011

Page 10: Bayesian learning theory applied to human cognitionkruschke/articles/JacobsK2010.pdf · Overview Bayesian learning theory applied to human cognition Robert A. Jacobs1∗ and John

WIREs Cognitive Science Bayesian learning theory

from two different events, in which case the locationsindicated by these stimuli should not be integrated.Kording et al.47 defined two plausible Bayesian net-works, one corresponding to stimuli that originatefrom the same event and the other correspondingto stimuli that originate from different events. Usinga weighted average of the predictions of these twonetworks, the overall system was able to quantita-tively predict human subjects’ responses in a varietyof auditory–visual localization experiments.

As a second example, Griffiths et al.48 usedmodel averaging to account for human categorylearning. They noted that most computational modelsin the literature are either instances of prototypetheory, in which a single structure characterizes thestatistics of each category, or exemplar theory, inwhich a different structure characterizes the statisticsof every exemplar. Rather than choose betweenthese two extremes, they used an approach (knownas Dirichlet process mixture models) that can beimplemented using a model that starts with onestructure, but then adds additional structures asneeded during the course of learning. The end resultis that the data indicate a probability distributionover how many structures the model will have. Forsome data sets, the model may tend to use very fewstructures, whereas it may tend to use a large numberof structures for other data sets. Thus, the modelresembles neither a prototype theory nor an exemplartheory, but rather a hybrid that possesses many of thebest features of both types of theories.

Because model averaging can be computation-ally expensive, some researchers have pursued a lessexpensive approach: they parameterize a family ofplausible structures, and then search for the parametervalues with the highest posterior probability. (Thereader should note that this approach is not strictlyBayesian because it uses point estimates of theparameter values instead of using the full posteriorprobability distribution over parameter values.) Thisapproach essentially turns the structure learningproblem into a parameter learning problem. It istypically implemented through the use of hierarchicalBayesian networks in which distributions of param-eters at the top of the hierarchy govern distributionsof parameters at lower levels of the hierarchy.

Perhaps the most successful example of the useof this strategy comes from Kemp and Tenenbaum.49

These authors defined a hierarchy in which thetop level determined the form of a model, wherethere were eight possible forms (not necessarilymutually exclusive): partitions, chains, orders, rings,hierarchies, trees, grids, and cylinders. The next leveldetermined the structure of a particular form. For

example, if a Bayesian network was to have a treeform, then this level determined the particular treestructure that the network would have. It was foundthat the system consistently discovered forms andstructures that, intuitively, seem reasonable for adomain. For example, using a database of votescast by Supreme Court justices, it was found thata chain structure with ‘liberal’ justices at one endand ‘conservative’ justices at the other end fit thedata best. Similarly, based on the features of a largenumber of animals, it was found that a tree structurewith categories and subcategories closely resemblingthe Linnaean taxonomy of species fit the data best.

Although structure learning is computationallyexpensive, it can be extremely powerful because itgives a system enormous representational flexibility.A system can use the data to determine a probabilitydistribution over structures for representing that data.This representational flexibility is a key feature ofBayesian systems that is difficult to duplicate in othertypes of computational formalisms.

PRIOR KNOWLEDGEBayesian inference and learning begin with a model ofobservable variables (i.e., the likelihood function) andprior knowledge expressed as relative beliefs acrossdifferent model structures and as the distributionof beliefs over parameter values within specificstructures. The prior knowledge defines the space of allpossible knowable representations, and the degree towhich each representation is believed. The structure ofthe model of observable variables and the constraintsimposed by prior knowledge strongly influence theinferences and learning that result from Bayes’ rule.

Strong prior knowledge can facilitate largechanges in belief from small amounts of data. Forexample, let us assume we have prior knowledge thata coin is a trick coin that either comes up heads almostall the time or comes up tails almost all the time.In other words, the prior belief is that intermediatebiases, such as 30% heads, etc., are not possible. Weflip the coin once and find it comes up heads. From thissingle bit of data, we infer strong posterior belief thatthe coin is the type that comes up heads almost always.

The ability of people to learn from smallamounts of data can be addressed in a Bayesianframework by strong constraints on the prior beliefs.Let us assume, for example, that we do not yet knowthe meaning of the word ‘dog’. If we are shown alabeled example of a dog, what should we infer isthe meaning of ‘dog’? Children learn the extension ofthe word in only a few examples but, in principle,the word could refer to many sets of objects. Does

Volume 2, January /February 2011 2010 John Wiley & Sons, L td. 17

Page 11: Bayesian learning theory applied to human cognitionkruschke/articles/JacobsK2010.pdf · Overview Bayesian learning theory applied to human cognition Robert A. Jacobs1∗ and John

Overview wires.wiley.com/cogsci

the word refer to all furry things? To all things witha tail? To all four legged things that are smallerthan a pony but bigger than a mouse? A model ofword learning proposed by Xu and Tenenbaum50 isable to learn word-extension mappings very quickly,in close correspondence to human learning, largelyby virtue of strong constraints on prior beliefsregarding the space of possible extensions of words.The Bayesian learning model has a prior distributionthat emphasizes hierarchically structured extensions(e.g., dalmatians within dogs within animals) basedon perceptual similarity. It also includes a bias towardextensions that are perceptually distinctive. Becauseof the limited number of word meanings consistentwith the prior beliefs, only a few examples are neededto sharply narrow the possible meaning of a word.

A powerful tool for Bayesian modeling is theuse of hierarchical generative models to specify priorbeliefs. The highest level of the prior specifies a generictheory of the domain to be learned, brought to bear bythe learner. The theory generates all possible specificmodel structures for the domain, and the priors withinstructures.51,52 Learning simultaneously updatesbeliefs within and across model structures. The powerof the approach is that instead of the hypothesis spacebeing a heuristic intuited by the theorist, its assump-tions are made explicit and attributed to the learner’stheory regarding the domain. An example of a gener-ative hierarchical prior was described earlier, for rep-resenting structural relations among objects.49 Thatmodel uses a generative grammar as a theory for con-structing the space of all possible structural relationsamong objects. The generative grammar specifies iter-ative rules for relational graph construction, such asvarieties of node splitting and edge construction. Eachrule has a probability of application, thereby implicitlydefining prior probabilities on the universe of all pos-sible relational graphs. Another example comes fromGoodman et al.,53 who used a probabilistic generativegrammar to define a prior on the space of all possibledisjunctive-normal-form concepts. The prior implic-itly favors simpler concepts, because more complexconcepts require application of additional generativesteps, each of which can happen only with smallprobability. They showed that learning by the modelcaptures many aspects of human concept learning.

Intuitively, it seems obvious that people bringprior knowledge to bear when learning newinformation. But instead of the theorist merely posit-ing a prior and checking whether it can accountfor human learning, is there a way that the priorcan be assayed more directly? One intriguing pos-sibility is suggested by Kalish et al.54 They showedthat as chains of people are successively taught an

input–output relationship from the previous learner’sexamples, the noise in the transmission and learningprocesses quickly caused the learned relationship todevolve to the prior. In other words, because of theaccumulating uncertainty introduced by each succes-sive learner, the prior beliefs came to dominate theacquired representation.

ACTIVE LEARNING: A NATURAL ROLEFOR BAYESIAN MODELS

Until this point in the article, we have emphasizedthe representational richness afforded by the Bayesianapproach, with the only ‘process’ being the applicationof Bayes’ rule in inference and learning. But anotherrich extension afforded by explicit representation ofhypothesis spaces is models of active learning. In activelearning, the learner can intervene upon the environ-ment to select the next datum sampled. For example,experimental scientists select the next experiment theyrun to extract useful information from the world. Thisactive probing of the environment is distinct from allthe examples of learning mentioned previously in thearticle, which assumed the learner was a passive recip-ient of data chosen externally, like a sponge stuck to arock, passively absorbing particles that happen to besprayed over it by external forces.

One possible goal for active learning is maximalreduction in the uncertainty of beliefs. The learnershould probe the environment in such a way thatthe information revealed is expected to shift beliefsto a relatively narrow range of hypotheses. Nelson55

described a variety of candidate goals for an activelearner. The point is that all of these goals rely onthere being a space of hypotheses with a distributionof beliefs. Many aspects of human active learning havebeen addressed by models of uncertainty reduction.For example, the seemingly illogical choices madeby people in the Wason card selection task have beenshown to be optimal under certain reasonable priors.56

The interventions people make to learn about causalnetworks can be modeled as optimal informationgain.57 And active choice in associative learningtasks can be addressed with different combinationsof models, priors, and goals.38

CONCLUSIONS

As discussed above, Bayesian models can use a widevariety of assumptions about the representation ofprior beliefs, observations, and task goals. Bayesianmodeling has been useful to cognitive scientistsbecause it allows these scientists to explore different

18 2010 John Wiley & Sons, L td. Volume 2, January /February 2011

Page 12: Bayesian learning theory applied to human cognitionkruschke/articles/JacobsK2010.pdf · Overview Bayesian learning theory applied to human cognition Robert A. Jacobs1∗ and John

WIREs Cognitive Science Bayesian learning theory

sets of assumptions and their implications for ratio-nal behavior on a task. This helps cognitive scientistsunderstand the nature of a task. When it is found thata Bayesian model closely mimics human cognition ona task, we have a useful explanation of how it is thatcomplex cognition may be possible and why it worksas it does, i.e., because it is normatively rational forthat type of representational assumption and task.

We have emphasized three types of infor-mation processing operations in this article—infer-ence, parameter learning, and structure learning—thatresult from applying Bayes’ rule in different set-tings because these types of operations occur in bothBayesian models and human cognition. Does the factthat human behavior often seems to be suboptimalsuggest that our emphasis on Bayesian operations ismisplaced? We think that the answer is no, and thereare at least two ways of justifying this answer.

First, some researchers have begun to claim thathuman behavior may appear to be suboptimal, butthat this appearance is misleading. In fact, if cogni-tive limitations are taken into account, then humanbehavior is near-optimal.58 For example, Daw et al.59

argued that computational constraints may mean thatpeople cannot use Bayes’ rule to infer distributionsover unobserved variables due to the complexity ofthe required calculations and, thus, people need toresort to approximate inference methods. Among theinfinite variety of possible approximations, peopleuse the ‘best’ approximations within some class oftractable approximations, where best can be definedin various reasonable ways. The use of these bestapproximations may help explain suboptimal behav-iors. This type of approach to modeling suboptimalbehaviors is currently in its infancy, though recentmodels show promise.60–63

Second, a complementary approach to the chal-lenge of modeling suboptimal behavior is to change thelevel of analysis. Although it may be natural to supposethat individual persons perform Bayesian inferenceand learning, it is also just as plausible, in principle,that higher and lower levels of organization carry outinference and learning. For example, single neuronscan be modeled as carrying out Bayesian inference andlearning.64 At the other extreme, it is not unreasonableto model corporations as inferential and learning enti-ties, perhaps even using Bayesian formalisms. Whenthe behavior being modeled is relatively simple, then

the Bayesian computations required for the modelare also relatively simple. For example, a Bayesianmodel of a single neuron might be tractable with ahighly accurate approximation. The model remainsBayesian, with all its explanatory appeal. To cap-ture the behavior of complex systems of neurons,what is needed is a way for these locally Bayesianagents to communicate with each other. Kruschke65

described one reasonable heuristic for hierarchies oflocally Bayesian agents to interact. The frameworkhas been applied to an intermediate level of analysis,in which functional components of learning are purelyBayesian, although the system as a whole may exhibitwhat appears to be non-Bayesian behavior because oflimitations on communications between components.Sanborn and Silva62 described a way to reinterpretKruschke’s65 scheme in terms of approximate mes-sage passing in a globally Bayesian model. Thus, whatmay be approximately Bayesian at one level of analysismay be Bayesian at a different level of analysis.

In summary, Bayesian models of inference andlearning provide a rich domain for formulating the-ories of cognition. Any representational scheme isallowed, so long as it specifies prior beliefs and observ-able data in terms of probability distributions. Thenthe engine of Bayes’ rule derives predictions for per-ception, cognition, and learning in complex situations.Moreover, by maintaining explicit representations ofbeliefs over plausible hypotheses, the models alsopermit predictions for active learning and activeinformation foraging. By analyzing the structuresof tasks—including the relationships between modelassumptions and optimal task performances—cogni-tive scientists better understand what people’s mindsare trying to accomplish and what assumptions peo-ple may be making when thinking and acting in theworld.

NOTESaThe statements here are slightly misleading. In prac-tice, principal components are not typically identifiedvia the EM algorithm because practitioners do notconsider principal component analysis from a prob-abilistic viewpoint. However, if principal componentanalysis is given a probabilistic interpretation,66,67

then the statements here are accurate.

ACKNOWLEDGEMENTSWe thank P. Dayan and C. Kemp for commenting on earlier versions of this manuscript. This work wassupported by NSF research grant DRL-0817250.

Volume 2, January /February 2011 2010 John Wiley & Sons, L td. 19

Page 13: Bayesian learning theory applied to human cognitionkruschke/articles/JacobsK2010.pdf · Overview Bayesian learning theory applied to human cognition Robert A. Jacobs1∗ and John

Overview wires.wiley.com/cogsci

REFERENCES

1. Cox RT. The Algebra of Probable Inference. Baltimore,MD: Johns Hopkins University Press; 1961.

2. Busemeyer JR, Diederich A. Cognitive Modeling. LosAngeles, CA: Sage Publications; 2010.

3. Scarborough D, Sternberg S. An Invitation to Cogni-tive Science: Methods, Models and Conceptual Issues,vol 4. Cambridge, MA: MIT Press; 1998.

4. Polk TA, Seifert CM. Cognitive Modeling. Cambridge,MA: MIT Press; 2002.

5. Sun R. The Cambridge Handbook of ComputationalPsychology. New York, NY: Cambridge UniversityPress; 2008.

6. Marr D. Vision. New York, NY: Freeman; 1982.

7. Anderson JR. The Adaptive Character of Thought.Hillsdale, NJ: Lawrence Erlbaum Associates; 1990.

8. Barlow HB. Possible principles underlying the trans-formation of sensory messages. In: Rosenblith W, ed.Sensory Communication. Cambridge, MA: MIT Press;1961, 217–234.

9. Chater N, Oaksford M. The Probabilistic Mind.Oxford, UK: Oxford University Press; 2008.

10. Geisler WS. Ideal observer analysis. In: Chalupa LM,Werner JS, eds. The Visual Neurosciences. Cambridge,MA: MIT Press; 2004, 825–837.

11. Green DM, Swets JA. Signal Detection Theory andPsychophysics. New York, NY: John Wiley & Sons;1966.

12. Griffiths TL, Kemp C, Tenenbaum JB. Bayesian modelsof cognition. In: Sun R, ed. The Cambridge Hand-book of Computational Psychology. New York, NY:Cambridge University Press; 2008, 59–100.

13. Kahneman D, Slovic P, Tversky A. Judgment UnderUncertainty: Heuristics and Biases. Cambridge, UK:Cambridge University Press; 1982.

14. Knill DC, Richards W. Perception as Bayesian Infer-ence. Cambridge, UK: Cambridge University Press;1996.

15. Oaksford M, Chater N. Rational Models of Cognition.Oxford, UK: Oxford University Press; 1999.

16. Todorov E. Optimality principles in sensorimotor con-trol. Nat Neurosc 2004, 7:907–915.

17. Neapolitan RE. Learning Bayesian Networks. UpperSaddle River, NJ: Pearson Prentice Hall; 2004.

18. Pearl J. Probabilistic Reasoning in Intelligent Systems.San Mateo, CA: Morgan Kaufmann; 1988.

19. Russell S, Norvig P. Artificial Intelligence: A ModernApproach. 2nd ed. Upper Saddle River, NJ: PrenticeHall; 2003.

20. Gelman A, Carlin JB, Stern HS, Rubin DB. BayesianData Analysis. 2nd ed. New York, NY: Chapman andHall; 2003.

21. Calvert GA, Bullmore ET, Brammer MJ, CampbellR, Williams SCR, et al. Activation of auditory cortexduring silent lipreading. Science 1997, 276:593–596.

22. Pirog Revill K, Aslin RN, Tanenhaus MK, Bavelier D.Neural correlates of partial lexical activation. Proc NatlAcad Sci USA 2008, 105:13110–13114.

23. Blakemore S-J, Wolpert D, Frith C. Why can’t youtickle yourself? NeuroReport 2000, 11:11–15.

24. Yuille AL, Bulthoff HH. Bayesian theory and psy-chophysics. In: Knill D, Richards W, eds. Perception asBayesian Inference. Cambridge, UK: Cambridge Uni-versity Press; 1996, 123–161.

25. Ernst MO, Banks MS. Humans integrate visual andhaptic information in a statistically optimal fashion.Nature 2002, 415:429–433.

26. Knill DC, Saunders J. Do humans optimally integratestereo and texture information for judgments of surfaceslant? Vision Res 2003, 43:2539–2558.

27. Alais D, Burr D. The ventriloquist effect results fromnear-optimal bimodal integration. Curr Biol 2004,14:257–262.

28. Battaglia PW, Jacobs RA, Aslin RN. Bayesian inte-gration of visual and auditory signals for spatiallocalization. J Opt Soc Am A 2003, 20:1391–1397.

29. Ghahramani Z, Wolpert DM, Jordan MI. Computa-tional models of sensorimotor integration. In: MorassoPG, Sanguineti V, eds. Self-Organization, Computa-tional Maps, and Motor Control. New York, NY:Elsevier Science; 1997.

30. Jacobs RA. Optimal integration of texture and motioncues to depth. Vision Res 1999, 39:3621–3629.

31. Kording KP, Wolpert DM. Bayesian integration in sen-sorimotor learning. Nature 2004, 427:244–247.

32. Landy MS, Maloney LT, Johnston EB, Young M. Mea-surement and modeling of depth cue combination: indefense of weak fusion. Vision Res 1995, 35:389–412.

33. Maloney LT, Landy MS. A statistical framework forrobust fusion of depth information. Visual Commu-nications Image Processing IV, Proceedings of theSPIE 1199, 1989, 1154–1163.

34. Young MJ, Landy MS, Maloney LT. A perturbationanalysis of depth perception from combinations of tex-ture and motion cues. Vision Res 1993, 33:2685–2696.

35. Shanks DR. Forward and backward blocking inhuman contingency judgement. Q J Exp Psychol 1985,37B:1–21.

36. Rescorla RA, Wagner AR. A theory of Pavlovian condi-tioning: variations in the effectiveness of reinforcementand nonreinforcement. In: Black AH, Prokasy WF, eds.Classical Conditioning II: Current Research and The-ory. New York, NY: Appleton-Century-Crofts; 1972.

20 2010 John Wiley & Sons, L td. Volume 2, January /February 2011

Page 14: Bayesian learning theory applied to human cognitionkruschke/articles/JacobsK2010.pdf · Overview Bayesian learning theory applied to human cognition Robert A. Jacobs1∗ and John

WIREs Cognitive Science Bayesian learning theory

37. Kruschke JK. Bayesian approaches to associative learn-ing: from passive to active learning. Learn Behav 2008,36:210–226.

38. Dayan P, Kakade S. Explaining away in weight space.In: Leen T, Dietterich T, Tresp V, eds. Advances in Neu-ral Information Processing Systems, vol 13. Cambridge,MA: MIT Press; 2001, 451–457.

39. Sobel DM, Tenenbaum JB, Gopnik A. Children’s causalinferences from indirect evidence: backwards blockingand Bayesian reasoning in preschoolers. Cogn Sci 2004,28:303–333.

40. Tenenbaum JB, Griffiths TL. Theory-based causal infer-ence. In: Becker S, Thrun S, Obermayer K, eds.Advances in Neural Information Processing Systems,vol 15. Cambridge, MA: MIT Press; 2003, 35–42.

41. Courville AC, Daw ND, Touretzky DS. Bayesian theo-ries of conditioning in a changing world. Trends CognSci 2006, 10:295–300.

42. Jacobs RA, Shams L. Visual learning in multisensoryenvironments. Topics Cogn Sci 2009, 2:217–225.

43. Dempster AP, Laird NM, Rubin DB. Maximum likeli-hood from incomplete data via the EM algorithm. J RStat Soc B 1977, 39:1–38.

44. Roweis S, Ghahramani Z. A unifying review of linearGaussian models. Neural Comput 1999, 11:305–345.

45. Robinson RW. Counting labeled acyclic digraphs. In:Harary F, ed. New Directions in the Theory of Graphs.New York, NY: Academic Press; 1973, 239–273.

46. MacKay DJC. Information Theory, Inference, andLearning Algorithms. Cambridge, UK: Cambridge Uni-versity Press; 2003.

47. Kording KP, Beierholm U, Ma WJ, Quartz S, Tenen-baum JB, et al. Causal inference in multisensory per-ception. PLoS ONE 2007, 2:e943.

48. Griffiths TL, Sanborn AN, Canini KR, Navarro DJ.Categorization as nonparametric Bayesian density esti-mation. In: Oaksford M, Chater N, eds. The Probabilis-tic Mind: Prospects for Rational Models of Cognition.Oxford, UK: Oxford University Press; 2009, 303–328.

49. Kemp C, Tenenbaum JB. The discovery of struc-tural form. Proc Natl Acad Sci USA 2008,105:10687–10692.

50. Xu F, Tenenbaum JB. Word learning as Bayesian infer-ence. Psychol Rev 2007, 114:245–272.

51. Tenenbaum JB, Griffiths TL, Kemp C. Theory-basedBayesian models of inductive learning and reasoning.Trends Cogn Sci 2006, 10:309–318.

52. Tenenbaum JB, Griffiths TL, Niyogi S. Intuitive theoriesas grammars for causal inference. In: Gopnik A, SchulzL, eds. Causal Learning: Psychology, Philosophy, and

Computation. Oxford, UK: Oxford University Press;2007, 301–322.

53. Goodman ND, Tenenbaum JB, Feldman J, GriffithsTL. A rational analysis of rule-based concept learning.Cogn Sci 2008, 32:108–154.

54. Kalish ML, Griffiths TL, Lewandowsky S. Iteratedlearning: intergenerational knowledge transmissionreveals inductive biases. Psychon Bull Rev 2007,14:288–294.

55. Nelson JD. Finding useful questions: on Bayesian diag-nosticity, probability, impact, and information gain.Psychol Rev 2005, 112:979–999.

56. Oaksford M, Chater N. Optimal data selection: revi-sion, review, and reevaluation. Psychon Bull Rev 2003,10:289–318.

57. Steyvers M, Tenenbaum JB, Wagenmakers E-J, BlumB. Inferring causal networks from observations andinterventions. Cogn Sci 2003, 27:453–489.

58. Gigerenzer G, Todd PM, The ABC Research Group.Simple Heuristics that Make Us Smart. New York, NY:Oxford University Press; 1999.

59. Daw ND, Courville AC, Dayan P. Semi-rational mod-els: the case of trial order. In: Chater N, OaksfordM, eds. The Probabilistic Mind. Oxford, UK: OxfordUniversity Press; 2008, 431–452.

60. Brown SD, Steyvers M. Detecting and predictingchanges. Cogn Psychol 2009, 58:49–67.

61. Sanborn A, Griffiths TL, Navarro DJ. A more ratio-nal model of categorization. Proceedings of the 28thAnnual Conference of the Cognitive Science Society.Vancouver, Canada, 2006.

62. Sanborn A, Silva R. A machine learning perspectiveon the locally Bayesian model. Proceedings of the 31stAnnual Conference of the Cognitive Science Society.Amsterdam, The Netherlands, 2009.

63. Shi L, Feldman NH, Griffiths TL. Performing Bayesianinference with exemplar models. Proceedings of the30th Annual Conference of the Cognitive Science Soci-ety, 2008.

64. Deneve S. Bayesian spiking neurons II: learning. NeuralComput 2008, 20:118–145.

65. Kruschke JK. Locally Bayesian learning with appli-cations to retrospective revaluation and highlighting.Psychol Rev 2006, 113:677–699.

66. Roweis S. EM algorithms for PCA and SPCA. In: Jor-dan MI, Kearns MJ, Solla SA, eds. Advances in NeuralInformation Processing Systems 10. Cambridge, MA:MIT Press; 1998.

67. Tipping ME, Bishop CM. Probabilistic principal com-ponent analysis. J R Stat Soc B 1999, 21:611–622.

Volume 2, January /February 2011 2010 John Wiley & Sons, L td. 21