32
Psychonomic Bulletin & Review 2005, 12 (4), 573-604 In experimental science, it is desirable to hold all fac- tors constant except those intentionally manipulated. In psychology, however, this ideal is often not possible. El- ements such as participants and items vary, in addition to the intended factors. For example, a researcher interested in the psychology of reading might manipulate the part of speech and observe reading times. In this case, there is unintended variability from the selection of both par- ticipants and items. In his classic article, “The Language- as-Fixed-Effect Fallacy: A Critique of Language Statis- tics in Psychological Research,” H. H. Clark (1973) discussed how unintended variability from the simulta- neous selection of participants and items leads to under- estimation of confidence intervals and inflation of Type I error rates in conventional analysis. Type I error rate in- flation, or an increased tendency to find a significant ef- fect when none exists, is highly undesirable. To demonstrate the problem, consider the question of whether nouns and verbs are read at the same rate. To an- swer this question, a researcher could randomly select suitable verbs and nouns and ask a number of partici- pants to read them. Each participant produces a set of reading time scores for both nouns and verbs. A common approach is to tabulate for each participant one mean reading time for nouns and another for verbs. To test the hypothesis of the equality of reading rates, these pairs of mean reading times may be submitted to paired t tests. This analytic approach is often used in memory research. For example, Riefer and Rouder (1992) used this analy- sis to determine whether bizarre sentences are better re- membered than common ones. Clark (1973), however, argued that using t tests to analyze means tabulated across different items leads to Type I error rate inflation. In the following demonstration, we show by simulation that this inflation is not only real, but also surprisingly large. We generate data for a standard ANOVA-style model (discussed below) with no part-of-speech effects. We analyze these data by first computing participant means for each part of speech and then submitting these means to a paired t test. This process is performed re- peatedly, and the proportion of significant results is re- ported. If the test has no Type I error inflation, the pro- portion should be the nominal Type I error rate, which is set to the conventional value of .05. Consider the following ANOVA-style model for nouns: It is reasonable to expect that each participant has a unique effect on reading time; some participants are fast at read- ing, but others are slow. This effect for the ith participant 573 Copyright 2005 Psychonomic Society, Inc. This research is supported by NSF Grant SES-0095919 to J.N.R., Dongchu Sun, and Paul Speckman. We thank Dongchu Sun and Paul Speckman for many intensive conversations, and Andrew Heathcote, Trisha Van Zandt, and Richard Morey for helpful comments on a previous draft. Correspondence relating to this article may be sent to J. N. Rouder, Department of Psychological Sciences, 210 McAlester Hall, University of Missouri, Columbia, MO 65211 (e-mail: [email protected]). THEORETICAL AND REVIEW ARTICLES An introduction to Bayesian hierarchical models with an application in the theory of signal detection JEFFREY N. ROUDER University of Missouri, Columbia, Missouri and JUN LU American University, Washington, D.C. Although many nonlinear models of cognition have been proposed in the past 50 years, there has been little consideration of corresponding statistical techniques for their analysis. In analyses with nonlinear models, unmodeled variability from the selection of items or participants may lead to asymptotically bi- ased estimation. This asymptotic bias, in turn, renders inference problematic. We show, for example, that a signal detection analysis of recognition memory data leads to asymptotic underestimation of sen- sitivity. To eliminate asymptotic bias, we advocate hierarchical models in which participant variability, item variability, and measurement error are modeled simultaneously. By accounting for multiple sources of variability, hierarchical models yield consistent and accurate estimates of participant and item effects in recognition memory. This article is written in tutorial format; we provide an introduction to Bayesian statistics, hierarchical modeling, and Markov chain Monte Carlo computational techniques.

Rouder & Lu (2005)

  • Upload
    dangthu

  • View
    224

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Rouder & Lu (2005)

Psychonomic Bulletin amp Review2005 12 (4) 573-604

In experimental science it is desirable to hold all fac-tors constant except those intentionally manipulated Inpsychology however this ideal is often not possible El-ements such as participants and items vary in addition tothe intended factors For example a researcher interestedin the psychology of reading might manipulate the partof speech and observe reading times In this case thereis unintended variability from the selection of both par-ticipants and items In his classic article ldquoThe Language-as-Fixed-Effect Fallacy A Critique of Language Statis-tics in Psychological Researchrdquo H H Clark (1973)discussed how unintended variability from the simulta-neous selection of participants and items leads to under-estimation of confidence intervals and inflation of Type Ierror rates in conventional analysis Type I error rate in-flation or an increased tendency to find a significant ef-fect when none exists is highly undesirable

To demonstrate the problem consider the question ofwhether nouns and verbs are read at the same rate To an-swer this question a researcher could randomly select

suitable verbs and nouns and ask a number of partici-pants to read them Each participant produces a set ofreading time scores for both nouns and verbs A commonapproach is to tabulate for each participant one meanreading time for nouns and another for verbs To test thehypothesis of the equality of reading rates these pairs ofmean reading times may be submitted to paired t testsThis analytic approach is often used in memory researchFor example Riefer and Rouder (1992) used this analy-sis to determine whether bizarre sentences are better re-membered than common ones Clark (1973) howeverargued that using t tests to analyze means tabulated acrossdifferent items leads to Type I error rate inflation

In the following demonstration we show by simulationthat this inflation is not only real but also surprisinglylarge We generate data for a standard ANOVA-stylemodel (discussed below) with no part-of-speech effectsWe analyze these data by first computing participantmeans for each part of speech and then submitting thesemeans to a paired t test This process is performed re-peatedly and the proportion of significant results is re-ported If the test has no Type I error inflation the pro-portion should be the nominal Type I error rate which isset to the conventional value of 05

Consider the following ANOVA-style model for nounsIt is reasonable to expect that each participant has a uniqueeffect on reading time some participants are fast at read-ing but others are slow This effect for the ith participant

573 Copyright 2005 Psychonomic Society Inc

This research is supported by NSF Grant SES-0095919 to JNRDongchu Sun and Paul Speckman We thank Dongchu Sun and PaulSpeckman for many intensive conversations and Andrew Heathcote Trisha Van Zandt and Richard Morey for helpful comments on a previousdraft Correspondence relating to this article may be sent to J N RouderDepartment of Psychological Sciences 210 McAlester Hall University ofMissouri Columbia MO 65211 (e-mail rouderjmissouriedu)

THEORETICAL AND REVIEW ARTICLES

An introduction to Bayesian hierarchical models with an application in the

theory of signal detection

JEFFREY N ROUDERUniversity of Missouri Columbia Missouri

and

JUN LUAmerican University Washington DC

Although many nonlinear models of cognition have been proposed in the past 50 years there has beenlittle consideration of corresponding statistical techniques for their analysis In analyses with nonlinearmodels unmodeled variability from the selection of items or participants may lead to asymptotically bi-ased estimation This asymptotic bias in turn renders inference problematic We show for examplethat a signal detection analysis of recognition memory data leads to asymptotic underestimation of sen-sitivity To eliminate asymptotic bias we advocate hierarchical models in which participant variabilityitem variability and measurement error are modeled simultaneously By accounting for multiplesources of variability hierarchical models yield consistent and accurate estimates of participant anditem effects in recognition memory This article is written in tutorial format we provide an introduction toBayesian statistics hierarchical modeling and Markov chain Monte Carlo computational techniques

574 ROUDER AND LU

is denoted αi Likewise it is reasonable to expect thateach item has a unique effect on reading time someitems are read quickly and others are read slowly Thiseffect for the jth item is denoted βj Reading times reflectboth the participant and item effects as well as noise

(1)

where Nij is the ith participantrsquos reading time on the jthnoun μn is a grand reading time for nouns αi and βj areparticipant and item effects respectively and ε ij

(n) is anyadditional noise Random variables ε ij

(n) are independentnormals centered around 0 with equal variances Equa-tion 1 is a familiar additive form that underlies bothANOVA and regression For this paradigm participantand item effects should be treated as random It is rea-sonable to model them as independent random drawsfrom normal distributions

(2)

(3)

and

(4)

In these equations the symbol ldquo~rdquo is used to denote adistribution and may be read is distributed as Whenmore than one random variable is assigned as is the caseabove and they are independent ind~ will be used to de-note this relationship The model for nouns is similar toconventional ldquowithin-subjectsrdquo or repeated measuresmodels used in a standard ANOVA The difference isthat whereas in conventional models only participantsare treated as random effects in the present model bothparticipants and items are simultaneously treated as ran-dom effects

An analogous model is placed on verbs

(5)

where Vi j is the ith participantrsquos reading time on the jthverb μv is a grand reading time for verbs αi and γj areparticipant and item effects respectively and ε ij

(v) is anyadditional noise These random effects are modeledanalogously

(6)

and

(7)

We simulated data from this model and performed theconventional analysis on aggregated means Vector no-tation is helpful in describing the simulations Let αα de-note the vector of all subject random effects αα (α1 α2 αI) (Boldface type is reserved for vectors and ma-trices) The goal is to assess Type I error rates conse-quently data were simulated with no true difference inreading times between nouns and verbs (μn μv) Eachreplicate of the simulation starts with simulating partic-

ipant random effects (αα) noun-item random effects (ββ )and verb-item random effects (γγ ) as draws from normaldistribution In our first simulation there were 50 hypo-thetical participants each observing 50 nouns and 50verbs Hence there were 100 values of αα and 50 valueseach of ββ and γγ These random effects were kept constantthroughout a replication Next the values of the noiseεε (n) and εε (v) were sampled There was a total of 5000samples for each type of noise one for each participant-by-item combination Then scores N and V were com-puted by adding the grand means random effects andnoise in accordance with Equations 1 and 5 Mean scoresfor each participant in each part-of-speech conditionwere tabulated and submitted to a paired t test Therewere 500 independent replicates per simulation

We performed several simulations by manipulating σ 12

and σ 22 (σ 2 was set to 1 and serves to scale the other pa-

rameter values) The main result is that the real Type Ierror rate is a function of item variability (σ 2

2) Figure 1shows the proportion of Type I errors (significant t testresults) as a function of item variability The filled cir-cles are error rates for 50 hypothetical participants ob-serving 50 nouns and 50 verbs the circles with hatchedlines are error rates for 20 hypothetical participants ob-serving 20 nouns and 20 verbs With no item variabilitythe Type I error rate equals the nominal value of 05 Asitem variability is increased however the Type I errorrate increases and does so dramatically For example forthe simulation of the larger experiment when item vari-ability is only one third that of σ 2 the real Type I errorrate is around 40 This is a surprisingly high rate

The intuitive reason for the increased Type I error rategoes as follows For each replicate the aggregate itemscores mean(ββ ) and mean(γγ ) vary This variation af-fects all participants equally In effect this variation in-duces a correlation across participants If a sampled setof items is a bit quicker than usual all participants will beequally affected This correlation violates the independent-observations assumption of the t test It is not surprisingthen that there is an increase in Type I error rate

The analysis above is termed participant analysissince the data were aggregated across items to produceparticipant-specific scores (Baayen Tweedie amp Schreu-der 2002) One alternative would be item analysis inwhich data are aggregated across participants A meanreading score is then tabulated for each item and themean scores are submitted to an appropriate t test for in-ference Unfortunately the Type I error rate of this t testis inflated by participant variability Another alternativeis to perform both item and participant analyses Unfor-tunately this alternative is also flawed for if there is bothitem and participant variability each of these tests hasan inflated Type I error rate

There are valid statistical procedures for this problemClark (1973) proposed a correction a quasi-F statisticthat accounts for item variability This correction workswell (Forster amp Dickinson 1976 Raaijmakers Schrijne-makers amp Gremmen 1999) More recently Baayen

ε σij( ) v ind

~ Normal 0 2( )

γ σjind~ Normal 0 2

2( )

Vij i j ij= + + +μ α γ εvv( )

ε σij( ) n ind

~ Normal 0 2( )

β σjind~ Normal 0 2

2 ( )α σi

ind~ Normal 0 1

2 ( )

Nij i j ij= + + +μ α β εnn( )

HIERARCHICAL BAYESIAN MODELS 575

et al (2002) proposed a mixed linear model that ac-counts for both participant and item variation Experi-mentalists however have a more intuitive approachreplication The more a finding is replicated the lowerthe chance that it is due to a Type I error For exampleconsider two independent researchers who replicate eachotherrsquos experiment at a nominal Type I error rate of 05Assume that due to unaccounted item variability the ac-tual Type I error rate for each replication is 2 If bothreplications are significant the combined Type I errorrate is 04 which is below the nominal criterion Psy-chologists do not often use strict replication but insteaduse near replications in which there are only minor pro-cedural differences across replicates Consider thebizarre memory example above Although Riefer andRouder (1992) used aggregation to conclude that bizarresentences are better recalled than common ones thebasic finding has been obtained repeatedly (see egEinstein McDaniel amp Lackey 1989 Hirshman Whel-ley amp Palij 1989 Pra Baldi de Beni Cornoldi amp Cave-don 1985 Wollen amp Cox 1981) so it is surely not theresult of a Type I error Oft-replicated phenomena suchas the Stroop effect and semantic priming effects arecertainly not spurious

The reason that replication is feasible in linear con-texts (such as those underlying both t tests and ANOVA)is that population means can be estimated without biaseven when there is unmodeled variability For examplein our simulation of Type I error rates estimates of truecondition means were quite accurate and showed no ap-parent bias Consequently the true difference betweentwo groups can be estimated without bias and may be ob-tained with greater accuracy by increasing sample sizeIn the case in which there is no true difference increas-ing the sample size yields increasingly better estimatesof the null group difference

The situation is not nearly so sanguine for nonlinearmodels Examples of nonlinear models include signal

detection (Green amp Swets 1966) process dissociation(Egan 1975 Jacoby 1991) the diffusion model (Rat-cliff 1978 Ratcliff amp Rouder 1998) the fuzzy logicalmodel of perception (Massaro amp Oden 1979) the hy-brid deadline model (Rouder 2000) the similaritychoice model (Luce 1963) the generalized contextmodel (Medin amp Schaffer 1978 Nosofsky 1986) andthe interactive activation model (McClelland amp Rumel-hart 1981) In fact almost all models used for cognitionand perception outside the ANOVAregression frame-work are nonlinear Nonlinear models are postulated be-cause they are more realistic than linear models Eventhough psychologists have readily adopted nonlinearmodels they have been slow to acknowledge the effectsof unmodeled variability in these contexts These effectsare not good Unmodeled variability often yields dis-torted parameter estimates in many nonlinear modelsFor this reason the assessment of true differences inthese models is difficult

EFFECTS OF VARIABILITY IN SIGNAL DETECTION

To demonstrate the effects of unmodeled variabilityon a nonlinear model we performed a set of simulationsin which the theory of signal detection (Green amp Swets1966) was applied to a recognition memory paradigmConsider the analysis of a recognition memory experi-ment that entails both randomly selected participants anditems In the signal detection model participants moni-tor the familiarity of a target If familiarity is above cri-terion participants report the target as an old item oth-erwise they report it as a new item The distribution offamiliarity is assumed to be greater for old than for newitems and the degree of this difference is the sensitivityHits occur when old items are judged as old false alarmsoccur when new items are judged as old The model isshown in Figure 2 It is reasonable to expect that there is

Figure 1 The effect of unmodeled item variability (σσ2) on Type I error rate whendata are aggregated across items All error rates were computed for a nominal Type Ierror rate of 05

576 ROUDER AND LU

participant-level variability in sensitivity some partici-pants have better mnemonic abilities than others It isreasonable also to expect item-level variability someitems are easier to remember than others

To explore the effects of variability we implementeda simulation similar to the previous one Let d primeij denotethe ith individualrsquos sensitivity to the jth item likewiselet cij denote the ith individualrsquos criterion when assessingthe familiarity of the jth item (cij is measured as the dis-tance from the mean of the new-item distribution seeFigure 2) Consider the following model

dijprime μdαiβj (8)

andcij μcαi 2βj 2 (9)

Parameters μd and μc are grand means parameter αi de-notes the effect of the ith participant and parameter βjdenotes the effect of the jth item The model on d prime is ad-ditive for items and participants in this sense it is anal-ogous to the previous model for reading times One odd-looking feature of this simulation model is thatparticipant and item effects are half as large on criteriaas they are on sensitivity This feature is motivated byGlanzer Adams Iverson and Kimrsquos (1993) mirror ef-fect The mirror effect refers to an often-observed patternin which manipulations that increase sensitivity do so byboth increasing hit rates and decreasing false alarmsConsider the case of unbiased responding shown in Fig-

ure 2 In the figure the criterion is half of the value of thesensitivity If a manipulation were to increase sensitivityby increasing the hit rate and decreasing the false alarmrate in equal increments then criterion must increasehalf as much as the sensitivity gain (like sensitivity cri-terion is measured from 0 the mean of the new-item dis-tribution) In Equations 8 and 9 the particular effect ofa participant or item is balanced in hit and false alarmrates If a particular participant or item is associated withhigh sensitivity the effect is equally strong in hit andfalse alarm rates

Because αi and βj denote the effects from randomlyselected participants and items it is reasonable to modelthem as random effects These are given by

(10)

and

(11)

We simulated data in a manner similar to the previoussimulation First item and participant effects were sam-pled according to Equations 10 and 11 Then these ran-dom effects were combined according to Equations 8and 9 in order to produce underlying sensitivities and cri-teria for each participant item combination On the basisof these true values hypothetical responses were pro-duced in accordance with the signal detection modelshown in Figure 2 These hypothetical responses are di-chotomous A specific participant judges a specific itemas either old or new These responses were aggregatedacross items to produce hit and false alarm rates for eachindividual From these aggregated rates d prime and c werecomputed for each participant Because individualizedestimates of d prime were computed variability across partic-ipants was not problematic (σ1 was set equal to 5) Theleft panel of Figure 3 shows the results of a simulationwith no item effects (σ2 0) As expected d prime estimatesappear unbiased The right panel shows a case with itemvariability (σ2 15) The estimates systematicallyunderestimate true values This underestimation is an as-ymptotic bias it does not diminish as the numbers ofparticipants and items increase We implemented vari-ability in the context of a mirror effect but the underes-timation is also obtained in other implementations Sim-ulations with item variability exclusively in hits result inthe same underestimation Likewise simulations withitem variability in c alone result in the same underesti-mation (see Wickelgren 1968 who makes a comparableargument)

The presence of asymptotic bias is disconcerting Un-like analysis with ANOVA replication does not guaran-tee correct inference In the case of signal detection onecannot tell whether a difference in overall estimated sen-sitivity between two conditions is due to a true sensitiv-ity difference or to an increase in unmodeled variabilityin one condition relative to the other One domain inwhich asymptotic underestimation is particularly perni-

β σjind~ Normal 0 2

2 ( )

α σiind~ Normal 0 1

2( )

Figure 2 The signal detection model Hit and false alarm prob-abilities are the areas of the old-item and new-item familiaritydistributions that are greater than the criterion respectively Val-ues of sensitivity (d primeprime) and criterion (c) are measured from 0 themean of the new-item familiarity distribution

HIERARCHICAL BAYESIAN MODELS 577

cious is signal detection analyses of subliminal semanticactivation Greenwald Draine and Abrams (1996) forexample used signal detection sensitivity aggregatedacross items to claim that a set of word primes was sub-liminal (d prime 0) These subliminal primes then affect re-sponse times to subsequent stimuli The asymptoticdownward bias from item variability however renderssuspect the claim that the primes were truly subliminal

The presence of asymptotic bias with unmodeled vari-ability is not unique to signal detection In general un-modeled variability leads to asymptotic bias in nonlinearmodels Psychologists have long known about problemsassociated with aggregation in specific contexts For ex-ample Estes (1956) warned about aggregating learningcurves across participants Aggregate learning curves maybe more graded than those of individuals leading re-searchers to misdiagnose underlying mechanisms(Haider amp Frensch 2002 Heathcote Brown amp Mewhort2000) Curran and Hintzman (1995) critiqued aggrega-tion in Jacobyrsquos (1991) process dissociation procedureThey showed that aggregating responses over partici-pants items or both possibly leads to asymptotic under-estimation of automaticity estimates Ashby Maddoxand Lee (1994) showed that aggregating data across par-ticipants possibly distorts estimates from similaritychoice-based scaling (such as those from Gilmore HershCaramazza amp Griff in 1979) Each of the examplesabove is based on the same general problem Unmodeledvariability distorts estimates in a nonlinear context In allof these cases the distortion does not diminish with in-creasing sample sizes Unfortunately the field has beenslow to grasp the general nature of this problem The cri-tiques above were made in isolation and appeared as spe-cific critiques of specific models Instead we see them

as different instances of the negative effects of aggregat-ing data in analyses with nonlinear models

The solution is to model both participant and itemvariability simultaneously We advocate Bayesian hier-archical models for this purpose (Rouder Lu Speck-man Sun amp Jiang 2005 Rouder Sun Speckman Luamp Zhou 2003) In a hierarchical model variability onseveral levels such as from the selection of items andparticipants as well as from measurement error is mod-eled simultaneously As was mentioned previously hier-archical models have been used in linear contexts to im-prove power and to better control Type I error rates (seeeg Baayen et al 2002 Kreft amp de Leeuw 1998) Al-though better statistical control is attractive to experi-mentalists it is not critical Instead it is often easier (andwiser) to replicate results than to learn and implementcomplex statistical methodologies For nonlinear mod-els however hierarchical models are critical becausebias is asymptotic

The main drawback to nonlinear hierarchical modelsis tractability They are difficult to implement Startingin the 1980s and 1990s however new computationaltechniques based on Bayesian analysis emerged in sta-tistics The utility of these new techniques is immensethey allow for the analysis of previously intractable mod-els including nonlinear hierarchical ones (Gelfand ampSmith 1990 Gelman Carlin Stern amp Rubin 2004Gill 2002 Tanner 1996) These new Bayesian-basedtechniques have had tremendous impact in many quanti-tatively oriented disciplines including statistics engi-neering bioinformatics and economics

The goal of this article is to provide an introduction toBayesian hierarchical modeling The focus is on a rela-tively new computational technique that has made

No Item Variability Item Variability

True Sensitivity True Sensitivity

Est

imat

ed S

ensi

tivity

30

25

20

15

10

05

0

30

25

20

15

10

05

0

0 05 10 15 20 25 30 0 05 10 15 20 25 30

Figure 3 The effect of unmodeled item variability (σσ2) on the estimation of sensitivity (d primeprime) True values of d primeprimeare the means of d primeprimeij across items The left and right panels respectively show the results without and with itemvariability (σσ 2 15) For both plots the value of σσ 1 was 5

578 ROUDER AND LU

Bayesian analysis more tractable Markov chain MonteCarlo sampling In the first section basic issues of esti-mation are discussed Then this discussion is expandedto include the basics of Bayesian estimation Followingthat Markov chain Monte Carlo integration is intro-duced Finally a series of hierarchical signal detectionmodels is presented These models provide unbiased andrelatively efficient sensitivity estimates for the standardequal-variance signal detection model even in the pres-ence of both participant and item variability

ESTIMATION

It is easier to discuss Bayesian estimation in the con-text of a simple example Consider the case in which asingle participant performs N trials The result of eachtrial is either a success or a failure We wish to estimatethe underlying true probability of a success denoted pfor this single participant Estimating a probability ofsuccess is essential to many applications in cognitivepsychology including signal detection analysis In signaldetection we typically estimate hit and false alarm ratesHits are successes on old-item trials and false alarms arefailures on new-item trials

A formula that provides an estimate is termed an esti-mator A reasonable estimator for this case is

(12)

where y is the number of successes out of N trials Esti-mator p0 is natural and straightforward

To evaluate the usefulness of estimators statisticiansusually discuss three basic properties bias efficiencyand consistency Bias and efficiency are illustrated inTable 1 The data are the results of weighing a hypothet-ical person of 170 lb on two hypothetical scales four sep-arate times Bias refers to the mean of repeated estimatesScale A is unbiased because the mean of the estimatesequals the true value of 170 lb Scale B is biased Themean is 172 lb which is 2 lb greater than the true valueof 170 lb Scale B however has a smaller degree of errorthan does Scale A so Scale B is termed more efficientthan Scale A Efficiency is the inverse of the expectederror of an observation and may be indexed by the reci-procal of root mean squared error Bias and efficiencyhave the same meaning for estimators as they do forscales Bias refers to the difference between the averagevalue of an estimator and a true value Efficiency refers

to the average magnitude of the difference between anestimator and a true value In many situations efficiencydetermines the quality of inference more than bias does

How biased and efficient is estimator p0 To providea context for evaluation consider the following two al-ternative estimators

(13)

and

(14)

These two alternatives may seem unprincipled but as isdiscussed in the next section they are justif ied in aBayesian framework

Figure 4 shows sampling distributions for the threeprobability estimators when the true probability is p 7and there are N 10 trials Estimator p0 is not biasedbut estimators p1 and p2 are Surprisingly estimator p2has the lowest average error that is it is the most effi-cient The reason is that the tails of the sampling distri-bution are closer to the true value of p 7 for p2 thanfor p1 or p0 Figure 4 shows the case for a single truevalue of p 7 Figure 5 shows bias and efficiency forall three estimators for the full range of p The conven-tional estimator p0 is unbiased for all true values of pbut the other two estimators are biased for extreme prob-abilities None of the estimators is always more efficientthan the others For intermediate probabilities estimatorp2 is most efficient for extreme probabilities estimatorp0 is most efficient Typically researchers have someidea of what type of probability of success to expect intheir experiments This knowledge can therefore be usedto help pick the best estimator for a particular situation

The other property of estimators is consistency Aconsistent estimator converges to its true value As thesample size is increased the estimator not only becomesunbiased the overall variance shrinks toward 0 Consis-tency should be viewed as a necessary property of a goodestimator Fortunately many estimators used in psychol-ogy are consistent including sample means sample vari-ances and sample correlation coefficients The three es-timators p0 p1 and p2 are consistent with sufficientdata they converge to the true value of p In fact all es-timators presented in this article are consistent

BAYESIAN ESTIMATION OF APROBABILITY

In this section we provide a Bayesian approach to es-timating the probability of success The datum in this ap-plication is the number of successes (y) and this quan-tity may be modeled as a binomial

y ~ Binomial(N p)

where N is known and p is the unknown parameter of interest

ˆ py

N212

= ++

ˆ p

yN1

51

= ++

ˆ pyN0 =

Table 1Weight of a 170-Pound Person Measured on

Two Hypothetical Scales

Scale A Scale B

Data (lb) 180 160 175 165 174 170 173 171Mean 170 172Bias 0 20RMSE 791 255

HIERARCHICAL BAYESIAN MODELS 579

Bayesrsquos (1763) insightful contribution was to use thelaw of conditional probability to estimate the parameterp He noted that

(15)

We describe each of these terms in orderThe main goal is to find f ( p | y) the quantity on the

left-hand side of Equation 15 The term f( p | y) is the dis-tribution of the parameter p given the data y This distri-bution is referred to as the posterior distribution It isviewed as a function of the parameter p The mean of theposterior distribution termed the posterior mean servesas a suitable point estimator for p

The term f(y | p) plays two roles in statistics depend-ing on context When it is viewed as a function of y forknown p it is known as the probability mass functionThe probability mass function describes the probabilityof any outcome y given a particular value of p For a bi-nomial random variable the probability mass function isgiven by

(16)

For example this function can be used to compute theprobability of observing a total of two successes on threetrials when p 5

38

The term f(y | p) can also be viewed as a function ofparameter p for fixed y In this case it is termed the like-lihood function and describes the likelihood of a partic-ular parameter value p given a fixed value of y For thebinomial the likelihood function is given by

(17)

In Bayesrsquos theorem we are interested in the posterior asa function of the parameter p Hence the right-handterms of Equation 15 are viewed as a function of param-eter p Therefore f (y | p) is viewed as the likelihood of prather than the probability mass of y

The term f (p) is the prior distribution and reflects theexperimenterrsquos a priori beliefs about the true value of p

Finally the term f (y) is the distribution of the datagiven the model Although its interpretation is importantin some contexts it plays a minimal role in the develop-

f y pN

yp p py N y( | ) ( ) =

⎛⎝⎜

⎞⎠⎟

minus le leminus1 0 1

f y p( | ) ( ) ( )= = =⎛⎝⎜

⎞⎠⎟

minus2 53

25 1 52 1

f y p

N

yp p y Ny N y

( | )( )

=⎛⎝⎜

⎞⎠⎟

minus =minus1 0

0 otherwiise

⎧⎨⎪

⎩⎪

f p yf y p f p

f y( | )

( | ) ( )( )

=

Figure 4 Sampling distributions of p0 p1 and p2 for N 10 trialswith a p 7 true probability of success on any one trial Bias and rootmean squared error (RMSE) are included

580 ROUDER AND LU

ment presented here for the following reason The pos-terior is a function of the parameter of interest p butf (y) does not depend on p The term f (y) is a constantwith respect to p and serves as a normalizing constantthat ensures that the posterior density integrates to 1 Aswill be shown in the examples the value of this normal-izing constant usually becomes apparent in analysis

To perform Bayesian analysis the researcher mustchoose a prior distribution For this application of esti-mating a probability from binomially distributed datathe beta distribution serves as a suitable prior distribu-tion for p The beta is a flexible two-parameter distribu-tion see Figure 6 for examples In contrast to the normaldistribution which has nonzero density on all numbersthe beta has nonzero density between 0 and 1 The betadistribution is a function of two parameters a and b thatdetermine its shape The notation for specifying a betadistribution prior for p is

p ~ Beta(a b)

The density of a beta random variable is given by

(18)

The denominator Be(a b) is termed the beta function1

Like f (y) it is not a function of p and plays the role of anormalizing constant in analysis

In practice researchers must completely specify theprior distribution before analysis that is they must choosesuitable values for a and b This choice reflects re-searchersrsquo beliefs about the possible values of p beforedata are collected When a 1 and b 1 (see the mid-

dle panel of Figure 6) the beta distribution is flat that isthere is equal density for all values in the interval [0 1]By choosing a b 1 the researcher is committing toall values of p being equally likely before data are col-lected Researchers can choose values of a and b to bestmatch their beliefs about the expected data For exam-ple consider a psychophysical experiment in which thevalue of p will almost surely be greater than 5 For thisexperiment priors with a b are appropriate

The goal is to derive the posterior distribution usingBayesrsquos theorem (Equation 15) Substituting the likeli-hood function (Equation 17) and prior (Equation 18) intoBayesrsquos theorem yields

Collecting terms in p yields

f ( p | y) kp (ya1)(1p)(Nyb1) (19)

where

The term k is constant with respect to p and serves as anormalizing constant Substituting

aprime y a bprime N y b (20)

into Equation 19 yields

f ( p | y) kp aprime1(1p)bprime1

This equation is proportional to a beta density for pa-rameters aprime and bprime To make f ( p | y) a proper probabilitydensity k has a value such that f ( p | y) integrates to 1

k a b f yN

y= times

⎛⎝⎜

⎞⎠⎟

minus[ ( ) ( )] Be 1

f p yN

y

p p p pa b

y N y a b

( | )( ) ( )

( =

⎛⎝⎜

⎞⎠⎟

minus minusminus minus minus1 11 1

Be )) ( )

times f y

f pp p

a b

a b

( )( )

( )= minusminus minus1 11

Be

Figure 5 Bias and root mean squared error (RMSE) for thethree estimators as functions of true probability of success Thesolid dashed and dasheddotted lines denote the characteristicsof p0 p1 and p2 respectively

Figure 6 Probability density function of the beta distributionfor various values of parameters a and b

HIERARCHICAL BAYESIAN MODELS 581

This occurs only if the value of k is a beta functionmdashforexample k 1Be(a prime bprime) Consequently

(21)

The posterior like the prior is a beta but with param-eters aprime and bprime given by Equation 20

The derivation above was greatly facilitated by sepa-rating those terms that depend on the parameter of inter-est from those that do not The latter terms serve as anormalizing constant and can be computed by ensuringthat the posterior integrates to 10 In most derivationsthe value of the normalizing constant becomes apparentmuch as it did above Consequently it is convenient toconsider only those terms that depend on the parameterof interest and to lump the rest into a proportionality con-stant Hence a convenient form of Bayesrsquos theorem is

Posterior distribution Likelihood function Prior distribution

The symbol ldquordquo denotes proportionality and is read isproportional to This form of Bayesrsquos theorem is usedthroughout this article

Figure 7 shows both the prior and posterior of p forthe case of y 7 successes in N 10 trials The priorcorresponds to a beta with a b 1 and the observeddata are 7 successes in 10 trials The posterior in thiscase is a beta distribution with parameters aprime 8 andbprime 4 As can be seen the posterior is considerablymore narrow than the prior indicating that a large degreeof information has been gained from the data There aretwo valuable quantities derived from the posterior distri-bution the posterior mean and the 95 highest densityregion (also termed the 95 credible interval) The for-mer is the point estimate of p and the latter serves anal-ogously to a confidence interval These two quantitiesare also shown in Figure 7

For the case of a beta-distributed posterior the ex-pression for the posterior mean is

The posterior mean reduces to p1 if a b 5 it reducesto p2 if a b 1 Therefore estimators p1 and p2 aretheoretically justified from a Bayesian perspective Theestimator p0 may be obtained for values of a b 0 Forthese values however the integral of the prior distribu-tion is infinite If the integral of a distribution is infinitethe distribution is termed improper Conversely if the integral is finite the distribution is termed proper InBayesian analysis it is essential that posterior distribu-tions be proper However it is not necessary for the priorto be proper In some cases an improper prior will stillyield a proper posterior When this occurs the analysisis perfectly valid For estimation of p the posterior isproper even when a b 0

When the prior and posterior are from the same distri-bution the prior is termed conjugate The beta distribu-tion is the conjugate prior for binomially distributed databecause the resulting posterior is also a beta distributionConjugate priors are convenient they often facilitate thederivation of posterior distributions Conjugacy howeveris not at all necessary for Bayesian analysis A discussionabout conjugacy is provided in the General Discussion

BAYESIAN ESTIMATION WITH NORMALLY DISTRIBUTED DATA

In this section we present Bayesian analysis for nor-mally distributed data The normal plays a critical role inthe hierarchical signal detection model We will assumethat participant and item effects on sensitivity and biasare distributed as normals The results and techniquesdeveloped in this section are used directly in analyzingthe subsequent hierarchical signal detection models

Consider the case in which a sequence of observa-tions w (w1 wN) are independent and identicallydistributed as normals

The goal is to estimate parameters μ and σ 2 In this sec-tion we discuss a more limited goalmdashthe estimation ofone of the parameters assuming that the other is knownIn the next section we will introduce a form of Markovchain Monte Carlo sampling for estimation when bothparameters are unknown

Estimating μμ With Known σσ 2

Assume that σ 2 is known and that the estimation of μis of primary interest The goal then is to derive the pos-

wjind~ Normal μ σ 2( )

p aa b

y aN a b

= primeprime + prime

= ++ +

f p yp p

a b

a b

( | )( )( )

= minusprime prime

primeminus primeminus1 11Be

Figure 7 Prior and posterior distributions for 7 successes outof 10 trials Both are distributed as betas Parameters for theprior are a 1 and b 1 parameters for the posterior are aprimeprime 8 and b primeprime 4 The posterior mean and 95 highest density regionare also indicated

582 ROUDER AND LU

terior f (μ |σ 2 w) According to the proportional formof Bayesrsquos theorem

All terms are conditioned on σ 2 to reflect the fact that itis known

The first step is to choose a prior distribution for μThe normal is a suitable choice

Parameters μ0 and σ 02 must be specified before analysis

according to the researcherrsquos beliefs about μ By choos-ing σ 0

2 to be increasingly large the researcher can makethe prior arbitrarily variable In the limit the prior ap-proaches having equal density across all values Thisprior is considered noninformative for μ

The density of the prior is given by

Note that the prior on μ does not depend on σ 2 there-fore f (μ) f (μ |σ2)

The next step is multiplying the likelihood f (w | μσ 2) by the prior density f (μ |σ 2) For normally distrib-uted data the likelihood is given by

(22)

Multiplying the equation above by the prior yields

We concern ourselves with only those terms that involveμ the parameter of interest Other terms are used to nor-malize the posterior density and are absorbed into theconstant of proportionality

where

(23)

and

(24)

The posterior is proportional to the density of a normalwith mean and variance given by μprime and σ 2prime respectivelyBecause the conditional posterior distribution must inte-grate to 10 this distribution must be that of a normalwith mean μprime and variance σ 2prime

(25)Unfortunately the notation does not make it clear thatboth μprime and σ 2prime depend on the value of σ 2 When this de-pendence is critical it will be made explicit μprime(σ 2) andσ 2prime(σ 2)

Because the posterior and prior are from the samefamily the normal distribution is the conjugate prior forthe population mean with normally distributed data (withknown variance) In the limit that the prior is diffuse (ieas its variance is made increasingly large) the posteriorof μ is a normal with μprime w and σ 2prime σ 2N This dis-tribution is also the sampling distribution of the samplemean in classical statistics Hence for the diffuse priorthe Bayesian and conventional approaches yield the sameresults

Figure 8 shows the estimation for an example in whichthe data are w (112 106 104 111) and the variance isassumed known to be 16 The mean of these four obser-vations is 10825 For demonstration purposes the prior

μ σ μ σ| ~ 2 2w Normal prime( )prime

prime = +⎛⎝⎜

⎞⎠⎟

minus minus primesumμ σ σ μ σ20

20

2w jj

σ σ σprime minus minus minus= +( )2 2

02 1

N

f μ σμ μ

σ| exp 2

2

22w( ) minus minus prime( )⎛

⎝⎜⎜

⎠⎟⎟prime

f N( | ) expμ σσ σ

μ22

02

212

1w minus +⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

⎣⎢⎢

⎝⎜

minusminussum

+⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

+sum

+⎧⎨

22

0

02

2

202

02

w

w

jj

jj

σμσ

μ

σμσ

⎪⎪

⎩⎪

⎫⎬⎪

⎭⎪

⎦⎥⎥

⎠⎟

fw wj jjμ σ

μ μ

σ

μ

| exp22 2

212

2w( ) minus

minus +( )⎡

⎣⎢⎢

⎝⎜⎜

+

sum

220 0

2

02

2minus + ⎤

⎦⎥⎥

⎠⎟

μ μ μσ

fw j

j

μ σμ

σ

μ μ

| exp

exp

2

2

2

0

2w( ) minus minus( )⎛

⎜⎜

⎟⎟

timesminus minus

sum

(( )⎛

⎝⎜⎜

⎠⎟⎟

2

022σ

fw

N

j

j

μ σπσ

μ

σ| exp

2

2 2

2

21

2 2w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

timesminus minus( )⎛

⎝⎜⎜

⎠⎟⎟

1

2 20

0

2

02π σ

μ μ

σexp

fw

N

j

j

w | exp

μ σπσ

μ

σ2

2 2

2

21

2 2( ) =

( )minus minus( )⎛

⎜⎜

⎟sum ⎟⎟

f ( ) exp μπσ

μ μσ

=minus minus( )⎛

⎝⎜⎜

⎠⎟⎟

1

2 20

0

2

02

μ μ σ~ Normal 0 02( )

f f fμ σ μ σ μ σ| | | 2 2 2w w( ) ( ) ( )

HIERARCHICAL BAYESIAN MODELS 583

is constructed to be fairly informative with μ0 100and σ 0

2 100 The posterior is narrower than the priorreflecting the information gained from the data It is cen-tered at μprime 1079 with variance 385 Note that theposterior mean is not the sample mean but is modestlyinfluenced by the prior Whether the Bayesian or the con-ventional estimate is better depends on the accuracy ofthe prior For cases in which much is known about thedependent measure it may be advantageous to reflectthis information in the prior

Estimating σσ 2 With Known μμIn the last section the posterior for μ was derived for

known σ 2 In this section it is assumed that μ is knownand that the estimation of σ 2 is of primary interest Thegoal then is to estimate the posterior f (σ 2 | μ w) The inverse-gamma distribution is the conjugate prior for σ 2

with normally distributed data The inverse-gamma dis-tribution is not widely used in psychology As its nameimplies it is related to the gamma distribution If a ran-dom variable g is distributed as a gamma then 1g is dis-tributed as an inverse gamma An inverse-gamma prioron σ 2 has density

The parameters of the inverse gamma are a and b and inapplication these are chosen beforehand As a and b ap-proach 0 the inverse gamma approaches 1σ 2 which isconsidered the appropriate noninformative prior for σ 2

(Jeffreys 1961)

Multiplying the inverse-gamma prior by the likelihoodgiven in Equation 22 yields

Collecting only those terms dependent on σ 2 yields

where

aprime N2 a (26)and

(27)

The posterior is proportional to the density of an inversegamma with parameters aprime and bprime Because this posteriorintegrates to 10 it must also be an inverse gamma

σ 2 | μ w ~ Inverse Gamma (aprime bprime) (28)

Note that posterior parameter bprime depends explicitly onthe value of μ

MARKOV CHAIN MONTE CARLO SAMPLING

The preceding development highlights a problem Theuse of conjugate priors allowed for the derivation of pos-teriors only when some of the parameters were assumedto be known The posteriors in Equations 25 and 28 areknown as conditional posteriors because they depend onother parameters The quantities of primary interest arethe marginal posteriors μ | w and σ 2 | w The straightfor-

prime = minus( )⎡⎣⎢

⎤⎦⎥

+sumb w bj j μ2

2

f ba

σ μσ σ

2

2 1 21| exp w( )

( )minus prime⎛

⎝⎜⎞⎠⎟prime+

f

w b

N a

jj

σ μσ

μ

σ

2

2 2 1

2

2

1

5

|

exp

w( )

( )times minus

minus( ) +⎛

+ +

sum

⎝⎝

⎜⎜

⎟⎟

fw

N

j

j

σ μσ

μ

σ2

2 2

2

21

2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟⎟

times

sum

112 1 2

σ σ( )minus⎛

⎝⎜⎞⎠⎟+a

bexp

fw

N

j

j

σ μπσ

μ

σ2

2 2

2

21

2 2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

times( )

minus⎛⎝⎜

⎞⎠⎟+

b

a

ba

aΓ( )

exp σ σ2 1 2

f b

a

b a ba

σ σσ2

2 1 22 0( ) =

( )minus⎛

⎝⎜⎞⎠⎟

ge+

Γ( )exp

Figure 8 Prior and posterior distributions for estimating themean of normally distributed data with known variance Theprior and posterior are the dashed and solid distributions re-spectively The vertical line shows the sample mean The priorldquopullsrdquo the posterior slightly downward in this case

584 ROUDER AND LU

ward method of obtaining marginals is to integrate conditionals

In this case the integral is tractable but in many cases itis not When an integral is symbolically intractable how-ever it can often still be evaluated numerically One ofthe recent developments in numeric integration is Markovchain Monte Carlo (MCMC) sampling MCMC sam-pling has become quite popular in the last decade and isdescribed in detail in Gelfand and Smith (1990) We showhere how MCMC can be used to estimate the marginalposterior distributions of μ and σ 2 without assumingknown values We use a particular type of MCMC sam-pling known as Gibbs sampling (Geman amp Geman 1984)Gibbs sampling is a restricted form of MCMC the moregeneral form is known as MetropolisndashHastings MCMC

The goal is to find the marginal posterior distribu-tions MCMC provides a set of random samples from themarginal posterior distributions (rather than a closed-form derivation of the posterior density or cumulativedistribution function) Obtaining a set of random sam-ples from the marginal posteriors is sufficient for analy-sis With a sufficiently large sample from a random vari-able one can compute its density mean quantiles andany other statistic to arbitrary precision

To explain Gibbs sampling we will digress momen-tarily using a different example Consider the case ofgenerating samples from an ex-Gaussian random vari-able The ex-Gaussian is a well-known distribution incognitive psychology (Hohle 1965) It is the sum of nor-mal and exponential random variables The parametersare μ σ and τ the location and scale of the normal andthe scale of the exponential respectively An ex-Gaussianrandom variable x can be expressed in conditional form

andη ~ Exponential(τ )

We can take advantage of this conditional form to gen-erate ex-Gaussian samples First we generate a set ofsamples from η Samples will be denoted with squarebrackets [η]1 denotes the first sample [η]2 denotes thesecond and so on These samples can then be used in theconditional form to generate samples from the marginalrandom variable x For example the first two samples aregiven by

and

In this manner a whole set of samples from the mar-ginal x can be generated from the conditional randomvariable x |η Figure 9 shows that the method works Ahistogram of [x] generated in this manner converges tothe true ex-Gaussian density This is an example of MonteCarlo integration of a conditional density

We now return to the problem of deriving marginalposterior distributions of μ | w from the conditionalμ |σ 2 w If there was a set of samples from σ 2 | w MonteCarlo integration could be used to sample μ | w Like-wise if there was a set of samples of μ | w Monte Carlointegration could be used to sample σ 2 | w In Gibbs sam-pling these relations are used iteratively as followsFirst a value of [σ 2]1 is picked arbitrarily This value isthen used to sample the posterior of μ from the condi-tional in Equation 25

where μprime and σ 2 prime are the explicit functions of σ 2 givenin Equations 23 and 24 Next the value of [μ]1 is used tosample σ 2 from the conditional posterior in Equation 28In general for iteration m

(29)

and

(30)

In this manner sequences of random samples [μμ] and[σσ 2] are obtained

The initial samples of [μμ] and [σσ 2] certainly reflect thechoice of starting values [σ 2]1 and are not samples fromthe desired marginal posteriors It can be shown how-ever that under mild technical conditions the later sam-ples of [μμ] and [σσ 2] are samples from the desired mar-ginal posteriors (Tierney 1994) The initial region in

σ μ21

⎡⎣ ⎤⎦ prime prime ⎡⎣ ⎤⎦( )( )minusm ma b~ Inverse Gamma

[ ] μ μ σ σ σm m m~ Normal prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime2 2 2

μ μ σ σ σ⎡⎣ ⎤⎦ prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime1

2 2~ Normal1

2

1

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )2 22~ Normal +μ η σ

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )1 12~ Normal +μ η σ

x | ~ η μ η σNormal +( )2

f f f d( | ) | | μ μ σ σ σw w w= ( ) ( )int 2 2 2

Figure 9 An example of Monte Carlo integration of a normalconditional density to yield ex-Gaussian distributed samples Thehistogram presents the samples from a normal whose mean pa-rameter is distributed as an exponential The line is the densityof the appropriate ex-Gaussian distribution

HIERARCHICAL BAYESIAN MODELS 585

which the samples reflect the starting value is termed theburn-in period The later samples are the steady-state pe-riod Formally speaking the samples form irreducibleMarkov chains with stationary distributions that equalthe marginal posterior distribution (see Gilks Richard-son amp Spiegelhalter 1996) On a more informal levelthe first set of samples have random and arbitrary valuesAt some point by chance the random sample happens tobe from a high-density region of the true marginal pos-terior From this point forward the samples are from thetrue posteriors and may be used for estimation

Let m0 denote the point at which the samples are ap-proximately steady state Samples after m0 can be used toprovide both posterior means and credible regions of theposterior (as well as other statistics such as posteriorvariance) Posterior means are estimated by the arith-metic means of the samples

and

where M is the number of iterations Credible regions areconstructed as the centermost interval containing 95of the samples from m0 to M

To use MCMC to estimate posteriors of μ and σ 2 it isnecessary to sample from a normal and an inverse-gamma

distribution Sampling from a normal is provided in manysoftware packages and routines for low-level languagessuch as C or Fortran are readily available Samples froman inverse-gamma distribution may be obtained by tak-ing the reciprocal of samples from a gamma distributionAhrens and Dieter (1974 1982) provide algorithms forsampling the gamma distribution

To illustrate Gibbs sampling consider the case of es-timating IQ The hypothetical data for this example arefour replicates from a single participant w (112 106104 111) The classical estimation of μ is the samplemean (10825) and confidence intervals are constructedby multiplying the standard error (sw193) by the ap-propriate critical t value with three degrees of freedomFrom this multiplication the 95 confidence intervalfor μ is [1022 1144] Bayesian estimation was donewith Gibbs sampling Diffuse priors were placed on μ(μ0 0 σ 0

2 106) and σ 2 (a 106 b 106) Twochains of [μμ] are shown in the left panel of Figure 10each corresponding to a different initial value for [σ 2]1As can be seen both chains converge to their steady staterather quickly For this application there is very littleburn-in The histogram of the samples of [μμ] after burn-in is also shown (right panel) The histogram conforms toa t distribution with three degrees of freedom The poste-rior mean is 10826 and the 95 credible interval is[1022 1144] For normally distributed data the diffusepriors used in Bayesian analysis yield results that are nu-merically equivalent to their frequentist counterparts

The diffuse priors used in the previous case representthe situation in which researchers have no previous knowl-

σσ

2

2

0

0=

⎡⎣ ⎤⎦

minusgtsum

mm m

M m

μμ

=minus

gtsum [ ]m

m m

M m0

0

15

10

05

0

120

115

110

105

100

0 10 20 30 40 50 95 100 105 110 115 120IQ PointsIteration

Den

sity

IQ points

micro

Burn-in

Figure 10 Samples of μμ from Markov chain Monte Carlo integration for normally dis-tributed data The left panel shows two chains each from a poor choice of the initial value ofσσ 2 (the solid and dotted lines are for [σσ 2]1 10000 and [σσ 2]1 0001 respectively) Con-vergence is rapid The right panel shows the histogram of samples of μμ after burn-in Thedensity is the appropriate frequentist t distribution with three degrees of freedom

586 ROUDER AND LU

edge In the IQ example however much is known aboutthe distribution of IQ scores Suppose our test was normedfor a mean of 100 and a standard deviation of 15 Thisknowledge could be profitably used by setting μ0 100and σ 0

2 152 225 We may not know the particulartestndashretest correlation of our scales but we can surmisethat the standard deviations of replicates should be lessthan the population standard deviation After experiment-ing with the values of a and b we chose a 3 and b 100 The prior on σ 2 corresponding to these values isshown in Figure 11 It has much of its density below 100reflecting our belief that the testndashretest variability issmaller than the variability in the population Figure 11also shows a histogram of the posterior of μ (the posteriorwas obtained with Gibbs sampling as outlined above[σ 2]1 1 burn-in of 100 iterations) The posterior meanis 10795 which is somewhat below the sample mean of10825 This difference is from the prior people on aver-age score lower than the obtained scores Although it is abit hard to see in the figure the posterior distribution ismore narrow in the extreme tails than is the correspond-ing t distribution The 95 credible interval is [10211136] which is modestly different from the correspond-ing frequentist confidence interval This analysis shows

that the use of reasonable prior information can lead to aresult different from that of the frequentist analysis TheBayesian estimate is not only different but often more ef-ficient than the frequentist counterpart

The theoretical underpinnings of MCMC guaranteethat infinitely long chains will converge to the true pos-terior Researchers must decide how long to burn in chainsand then how long to continue in order to approximatethe posterior well The most important consideration inthis decision is the degree of correlation from one sam-ple to another In MCMC sampling consecutive samplesare not necessarily independent of one another The valueof a sample on cycle m 1 is in general dependent onthat of m If this dependence is small then convergencehappens quickly and relatively short chains are neededIf this dependence is large then convergence is muchslower One informal graphical method of assessing thedegree of dependence is to plot the autocorrelation func-tions of the chains The bottom right panel of Figure 11shows that for the normal application there appears to beno autocorrelation that is there is no dependence fromone sample to the next This fact implies that conver-gence is relatively rapid Good approximations may beobtained with short burn-ins and relatively short chains

Figure 11 Analysis with informative priors The two top panels showthe prior distributions on μμ and σσ 2 The bottom left panel shows the his-togram of samples of μμ after burn-in It differs from a t distribution fromfrequentist analysis in that it is shifted toward 100 and has a smallerupper tail The bottom right panel shows the autocorrelation functionfor μμ There is no discernible autocorrelation

HIERARCHICAL BAYESIAN MODELS 587

There are a number of formal procedures for assessingconvergence in the statistical literature (see eg Gel-man amp Rubin 1992 Geweke 1992 Raftery amp Lewis1992 see also Gelman et al 2004)

A PROBIT MODEL FOR ESTIMATING A PROBABILITY OF SUCCESS

The next step toward a hierarchical signal detectionmodel is to revisit estimation of a probability from indi-vidual trials Once again y successes are observed in Ntrials and y ~ Binomial(N p) where p is the parameterof interest In the previous section a beta prior was placedon parameter p and the conjugacy of the beta with the bi-nomial directly led to the posterior Unfortunately a betadistribution prior is not convenient as a prior for the sig-nal detection application

The signal detection model relies on the probit trans-form Figure 12 shows the probit transform which mapsprobabilities into z values using the normal inverse cu-mulative distribution function Whereas p ranges from 0to 1 z ranges across the real number line Let Φ denotethe standard normal cumulative distribution function Withthis notation p Φ(z) and conversely z Φ1( p)

Signal detection parameters are defined in terms ofprobit transforms

and

where p(h) and p(f ) are hit and false alarm probabilities re-spectively As a precursor to models of signal detectionwe develop estimation of the probit transform of a proba-bility of success z Φ1( p) The methods for doing sowill then be used in the signal detection application

When using a probit transform we model the y suc-cesses in N trials as

y ~ Binomial[N Φ(z)]

where z is the parameter of interestWhen using a probit it is convenient to place a normal

prior on z

(31)

Figure 13 shows various normal priors on z and by trans-form the corresponding prior on p Consider the specialcase when a standard normal is placed on z As is shownin the figure the corresponding prior on p is flat A flatprior also corresponds to a beta with a b 1 (see Fig-ure 6) From the previous development if a beta prior isplaced on p the posterior is also a beta with parametersaprime a y and bprime b (N y) Noting for a flat priorthat a b 1 it is expected that the posterior of p is dis-tributed as a beta with parameters aprime 1 y and bprime 1 (N y)

The next step is to derive the posterior f (z |Y ) for anormal prior on z The straightforward approach is tonote that the posterior is proportional to the likelihoodmultiplied by the prior

This equation is intractable Albert and Chib (1995) pro-vide an alternative for analysis and the details of this al-ternative are provided in Appendix A The basic idea isthat a new set of latent variables is defined upon which zis conditioned Conditional posteriors for both z andthese new latent variables are easily derived and sam-pled These samples can then be used in Gibbs samplingto find a marginal posterior for z

In this application it may be desirable to estimate theposterior of p as well as that of z This estimation is easyin MCMC Because p is the inverse probit transform ofz samples can also be inversely transformed Let [z] bethe chain of samples from the posterior of z A chain ofposterior samples of p is given by [ p]m Φ( [z]m)

The top right panel of Figure 14 shows the histogramof the posterior of p for y 7 successes on N 10 tri-als The prior parameters were μ0 0 and σ 0

2 1 so theprior on p was flat (see Figure 13) Because the prior isflat it corresponds to a beta with parameters (a 1 b 1) Consequently the expected posterior is a beta withdensity aprime y 1 8 and bprime (N y) 1 4 Thechains in the Gibbs sampler were started from a numberof values of [z]1 and convergence was always rapid Theresulting histogram for 10000 samples is plotted it ap-proximates the expected distribution The bottom panelshows a small degree of autocorrelation This autocorre-lation can be overcome by running longer chains andusing a more conservative burn-in period Chains of

f z Y z z

z

Y N Y( | ) ( ) [ ( )]

exp

Φ Φ1

20

2

02 1

minus

times minus minus( ) ( )minus

minusμ σ⎡⎡

⎣⎢⎤⎦⎥

z ~ Normal 0μ σ02( )

c p= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pΦ Φ1 1( ) ( )h f

Figure 12 Probit transform of probability ( p) into z scores (z)The transform is the inverse cumulative distribution function ofthe normal distribution Lines show the transform of p 69 intoz 05

588 ROUDER AND LU

1000 samples with 100 samples of burn-in are morethan sufficient for this application

ESTIMATING PROBABILITY FORSEVERAL PARTICIPANTS

In this section we propose a model for estimating sev-eral participantsrsquo probability of success The main goalis to introduce hierarchical models within the context ofa relatively simple example The methods and insightsdiscussed here are directly applicable to the hierarchicalmodel of signal detection Consider data from a numberof participants each with his or her own unique proba-bility of success pi In conventional estimation each ofthese probabilities is estimated separately usually by theestimator piyi Ni where yi and Ni are the number ofsuccesses and trials respectively for the ith participantWe show here that hierarchical modeling leads to moreaccurate estimation

In a hierarchical model it is assumed that although in-dividuals vary in ability the distribution governing indi-vidualsrsquo true abilities is orderly We term this distributionthe parent distribution If we knew the parent distributionthis prior information would be useful in estimationConsider the example in Figure 15 Suppose 10 individ-uals each performed 10 trials Hypothetical performanceis indicated with Xs Letrsquos suppose the task is relativelyeasy so that individualsrsquo true probabilities of successrange from 7 to 10 as indicated by the curve in the fig-

ure One participant does poorly however succeeding ononly 3 trials The conventional estimate for this poor-performing participant is 3 which is a poor estimate ofthe true probability (between 7 and 10) If the parentdistribution is known it would be evident that this poorscore is influenced by a large degree of sample noise andshould be corrected upward as indicated by the arrowThe new estimate which is more accurate is denoted byan H (for hierarchical)

In the example above we assumed a parent distribu-tion In actuality such assumptions are unwarranted Inhierarchical models parent distributions are estimatedfrom the data these parents in turn affect the estimatesfrom outlying data The estimated parent for the data inFigure 15 would have much of its mass in the upperranges When serving as a prior the estimated parentwould exert an upward tendency in the estimation of theoutlying participantrsquos probability leading to better esti-mation In Bayesian analysis the method of implement-ing a hierarchical model is to use a hierarchical priorThe parent distribution serves as a prior for each indi-vidual and this parent is termed the first stage of theprior The parent distribution is not fully specified butinstead free parameters of the parent are estimated fromthe data These parameters also need a prior on themwhich is termed the second stage of the prior

The probit transform model is ideally suited for a hi-erarchical prior A normally distributed parent may beplaced on transformed probabilities of success The ith

Prior 1 Prior 2 Prior 3

Priorson z

Corresponding Priors on p

N(01) N(11) N(0 2)4

3

2

1

0

4

3

2

1

0

4

3

2

1

0

ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6

5

4

3

2

1

0

5

4

3

2

1

0

5

4

3

2

1

0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Figure 13 Three normally distributed priors on z and the corresponding priors on p Priors on z were obtainedby drawing 100000 samples from the normal Priors on p were obtained by transforming each sample of z

HIERARCHICAL BAYESIAN MODELS 589

participantrsquos true ability zi is simply a sample from thisnormal The number of successes is modeled as

(32)

with a parent distribution (first-stage prior) given by

(33)

In this case parameters μ and σ 2 describe the locationand variance of the parent distribution and reflect group-level properties Priors are needed on μ and σ 2 (second-stage priors) Suitable choices are

(34)

andσ 2 ~ Inverse Gamma (a b) (35)

Values of μ0 σ 02 a and b must be specif ied before

analysisWe performed an analysis on hypothetical data in order

to demonstrate the hierarchical model Table 2 shows hy-pothetical data for a set of 20 participants performing 50trials These success frequencies were generated from abinomial but each individual had a unique true proba-bility of success These true probabilities2 which rangedbetween 59 and 80 are shown in the second column ofTable 2 Parameter estimators p0 (Equation 12) and p2(Equation 14) are shown

All of the theoretical results needed to derive the con-ditional posterior distributions have already been pre-

sented The Albert and Chib (1995) analysis for this ap-plication is provided in Appendix B A coded version ofMCMC for this model is presented in Appendix C

The MCMC chain was run for 10000 iterations Pri-ors were set to be fairly diffuse (μ0 0 σ 0

2 1000 a b 1) All parameters converged to steady-state behav-ior quickly (burn-in was conservative at 100 iterations)Autocorrelation of samples of pi for 2 select participantsis shown in the bottom panels of Figure 16 The auto-correlation in these plots was typical of other param-eters Autocorrelation for this model was no worse thanthat for the nonhierarchical Bayesian model of the pre-ceding section The top left panel of Figure 16 showsposterior means of the probability of success as a func-tion of the true probability of success for the hierarchi-cal model (filled circles) Also included are classical estimates from p0 (filled circles) Points from the hier-archical model are on average closer to the diagonalAs is indicated in Table 2 the hierarchical model haslower root mean squared error than does its frequentistcounterpart p0 The top right panel shows the hierarchi-cal estimates as a function of p0 The effect of the hier-archical structure is clear The extreme estimates in thefrequentist approach are moderated in the Bayesiananalysis

Estimator p2 also moderates extreme values The ten-dency is to pull them toward the middle specifically es-timates are closer to 5 than they are with p0 For the dataof Table 2 the difference in efficiency between p2 andp0 is exceedingly small (about 0001) The reason thatthere was little gain for p2 is that it pulled estimates to-ward a value of 5 which is not the mean of the trueprobabilities The mean was p 69 and for this valueand these sample sizes estimators p2 and p0 have aboutthe same efficiency The hierarchical estimator is supe-

μ μ σ~ Normal 02( )

zi | μ σ μ σ2 ind~ Normal 2( )

y z N zi i i i| ind~ Binomial Φ( )⎡⎣ ⎤⎦

Figure 14 (Top) Histograms of the post-burn-in samples of zand p respectively for 7 successes out of 10 trials The line fit forp is the true posterior a beta distribution with a primeprime 8 and bprimeprime 4(Bottom) Autocorrelation function for z There is a small degreeof autocorrelation

Figure 15 Example of how a parent distribution affects esti-mation The outlying observation is incompatible with the parentdistribution Bayesian estimation allows for the influence of theparent as a prior The resulting estimate is shifted upward re-ducing error

590 ROUDER AND LU

rior to these nonhierarchical estimators because it pullsoutlying estimates toward the mean value p Of coursethe advantage of the hierarchical model is a function ofsample size With increasing sample sizes the amountof pulling decreases In the limit all three estimatorsconverge to the true individualrsquos probability

A HIERARCHICAL SIGNAL DETECTION MODEL

Our ultimate goal is a signal detection model that ac-counts for both participant and item variability As a pre-cursor we present a hierarchical signal detection modelwithout item variability the model will be expanded inthe next section to include item variability The modelhere is applicable to psychophysical experiments in whichsignal and noise trials are identical replicates respectively

Signal detection parameters are simple subtractions ofprobit-transformed probabilities Let d primei and ci denotesignal detection parameters for the ith participant

and

where pi(h) and p

i(f ) are individualsrsquo true hit and false

alarm probabilities respectively Let hi and fi be the pro-bit transforms hi1( p i

(h)) and fi1( pi(f )) Then

sensitivity and criteria are given by

d primei hi fiand

ci fi

The tactic we follow is to place hierarchical models on hand f rather than on signal detection parameters directlyBecause the relationship between signal detection pa-rameters and probit-transformed probabilities is linearaccurate estimation of hi and fi results in accurate esti-mation of d primei and ci

The model is developed as follows Let Ni and Si de-note the number of noise and signal trials given to the ithparticipant Let y

i(h) and y

i(f ) denote the the number of

hits and false alarms respectively The model for yi(h) and

yi(f ) is given by

and

It is assumed that each participantrsquos parameters aredrawn from parent distributions These parent distribu-tions serve as the first stage of the hierarchical prior andare given by

(36)

and

(37)

Second-stage priors are placed on parent distribution pa-rameters (μk σ k

2) as

(38)

and

(39)σ k k ka b k2 ~ Inverse Gamma h f( ) =

μk k ku v k~ Normal h f( ) =

f iind~ Normal f fμ σ 2( )

hiind~ Normal h hμ σ 2( )

y f N fi i i i( ) | f ind

~ Binomial Φ( )⎡⎣ ⎤⎦

y h S hi i i i( ) | h ind

~ Binomial Φ( )⎡⎣ ⎤⎦

c pi i= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pi i iΦ Φ1 1( ) ( )h f

Table 2Data and Estimates of a Probability of Success

Participant True Prob Data p0 p2 Hierarchical

1 587 33 66 654 6722 621 28 56 558 6153 633 34 68 673 6834 650 29 58 577 6275 659 32 64 635 6606 665 37 74 731 7167 667 30 60 596 6388 667 29 58 577 6279 687 34 68 673 684

10 693 34 68 673 68211 696 34 68 673 68312 701 37 74 731 71513 713 38 76 750 72814 734 38 76 750 72715 736 37 74 731 71716 738 33 66 654 67217 743 37 74 731 71618 759 33 66 654 67219 786 38 76 750 72820 803 42 84 827 771

RMSE 053 053 041Data are numbers of successes in 50 trials

HIERARCHICAL BAYESIAN MODELS 591

Analysis takes place in two steps The first is to obtainposterior samples from hi and fi for each participantThese are probit-transformed parameters that may be es-timated following the exact procedure in the previoussection Separate runs are made for signal trials (produc-ing estimates of hi) and for noise trials (producing esti-mates of fi) These posterior samples are denoted [hi] and[ fi] respectively The second step is to use these posteriorsamples to estimate posterior distributions of individuald primei and ci This is easily accomplished For all samples

(40)

and(41)

Hypothetical data and parameter estimates are shownin Table 3 Individual variability in the data was imple-mented by sampling each participantrsquos d prime and c from alog-normal and from a normal distribution respectivelyTrue d prime values are shown and they range from 123 to234 From these true values of d prime and c true individualhit and false alarm probabilities were computed and in-dividual hit and false alarm frequencies were sampledfrom a binomial Hit and false alarm frequencies wereproduced in this manner for 20 hypothetical individualseach observing 50 signal and 50 noise trials

A conventional sensitivity estimate for each individ-ual was obtained using the formula

Φ1(Hit rate) Φ1(False alarm rate)

The resulting estimates are shown in the last column ofTable 3 labeled Conventional The hierarchical signaldetection model was analyzed with diffuse prior param-eters ah af bh bf 01 uh uf 0 and vh vf 1000 The chain was run for 10000 iterations with thefirst 500 serving as burn-in This choice of burn-in pe-riod is very conservative Autocorrelation for sensitivityfor 2 select participants is displayed in the bottom pan-els of Figure 17 this degree of autocorrelation was typ-ical for all parameters Autocorrelation was about thesame as in the previous application and the resulting es-timates are shown in the next-to-last column of Table 3

The top left panel of Figure 17 shows a plot of esti-mated dprime as a function of true d prime Estimates from the hi-erarchical model are closer to the diagonal than are theconventional estimates indicating more accurate esti-mation Table 3 also provides efficiency information thehierarchical model is 33 more efficient than the con-ventional estimates (this amount is typical of repeatedruns of the same simulation) The reason for this gain isevident in the top right panel of Figure 17 The extremeestimates in the conventional method which most likely

c fi m i m⎡⎣ ⎤⎦ = minus ⎡⎣ ⎤⎦

prime⎡⎣ ⎤⎦ = ⎡⎣ ⎤⎦ minus ⎡⎣ ⎤⎦d h fi m i m m

ConventionalHierarchical

Participant 5 Participant 15

True Probability Conventional Estimates

Lag Lag

5 6 7 8 9 5 6 7 8 9

0 10 20 30 40 0 10 20 30 40

Aut

ocor

rela

tion

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

9

8

7

6

5

8

4

0

9

8

7

6

5

8

4

0

Figure 16 Modeling the probability of success across several partici-pants (Top left) Probability estimates as functions of true probabilitythe triangles represent estimator p0 and the circles represent posteriormeans of pi from the hierarchical model (Top right) Hierarchical esti-mates as a function of p0 (Bottom) Autocorrelation functions of the pos-terior probability estimates for 2 participants from the hierarchicalmodel The degree of autocorrelation is typical of the other participants

592 ROUDER AND LU

reflect sample noise are shrunk to more moderate esti-mates in the hierarchical model

SIGNAL DETECTION WITH ITEMVARIABILITY

In the introduction we stressed that Bayesian hierar-chical modeling may provide a solution for the statisti-cal problems associated with unmodeled variability innonlinear contexts In all of the previous examples thegains from hierarchical modeling were in better effi-ciency Conventional analysis although not as accurateas Bayesian analysis certainly yielded consistent if notunbiased estimates In this section item variability isadded to signal detection Conventional analyses inwhich scores are aggregated over items result in incon-sistent estimates characterized by asymptotic bias Inthis section we show that in contrast to conventionalanalysis Bayesian hierarchical models provide consis-tent and accurate estimation

Our route to the model is a bit circuitous Our first at-tempt based on a straightforward extension of the pre-vious techniques fails because of an excessive amountof autocorrelation The solution to this failure is to use apowerful and recently developed sampling schemeblocked sampling (Roberts amp Sahu 1997) First we willproceed with the straightforward but problematic exten-sion Then we will introduce blocked sampling and showhow it leads to tractable analysis

Straightforward ExtensionIt is straightforward to specify models that include

item effects The previous notation is here expanded byindexing hits and false alarms by both participants and

items Let yij(h) and yij

(f ) denote the response of the ith per-son to the jth item Values of yij

(h) and yij(f ) are dichoto-

mous indicating whether the response was old or newLet pij

(h) and pij(f ) be true probabilities that yij

(h) 1 andyij

(f ) 1 respectively that is these probabilities are truehit and false alarm rates for a given participant-by-itemcombination respectively Let hij and fij be the probittransforms [hij Φ1( pij

(h)) fij Φ1( pij(f ))] The model

is given by

(42)

and

(43)

Linear models are placed on hij and fij

hij μh αi γj (44)

and

fij μf βi δj (45)

Parameters μh and μf correspond to the grand means ofthe hit and false alarm rates respectively Parameters ααand ββ denote participant effects on hits and false alarms respectively parameters γγ and δδ denote item effects onhits and false alarms respectively These participant anditem effects are assumed to be zero-centered random ef-fects Grand mean signal detection parameters denotedd prime and c are given by

d prime μh μf (46)

and

c μf (47)

y fij ij( ) f ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

y hij ij( )h ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

Table 3Data and Estimates of Sensitivity

FalseParticipant True d prime Hits Alarms Hierarchical Conventional

1 1238 35 7 1608 16052 1239 38 17 1307 11193 1325 38 11 1554 14784 1330 33 16 1146 08805 1342 40 19 1324 11476 1405 39 8 1730 17677 1424 37 12 1466 13508 1460 27 5 1417 13829 1559 35 9 1507 1440

10 1584 37 9 1599 155911 1591 33 8 1486 140712 1624 40 7 1817 192213 1634 35 11 1427 129714 1782 43 13 1687 172415 1894 33 3 1749 196716 1962 45 12 1839 198817 1967 45 4 2235 268718 2124 39 6 1826 194719 2270 45 5 2172 256320 2337 46 6 2186 2580

RMSE 183 275Data are numbers of signal responses in 50 trials

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 2: Rouder & Lu (2005)

574 ROUDER AND LU

is denoted αi Likewise it is reasonable to expect thateach item has a unique effect on reading time someitems are read quickly and others are read slowly Thiseffect for the jth item is denoted βj Reading times reflectboth the participant and item effects as well as noise

(1)

where Nij is the ith participantrsquos reading time on the jthnoun μn is a grand reading time for nouns αi and βj areparticipant and item effects respectively and ε ij

(n) is anyadditional noise Random variables ε ij

(n) are independentnormals centered around 0 with equal variances Equa-tion 1 is a familiar additive form that underlies bothANOVA and regression For this paradigm participantand item effects should be treated as random It is rea-sonable to model them as independent random drawsfrom normal distributions

(2)

(3)

and

(4)

In these equations the symbol ldquo~rdquo is used to denote adistribution and may be read is distributed as Whenmore than one random variable is assigned as is the caseabove and they are independent ind~ will be used to de-note this relationship The model for nouns is similar toconventional ldquowithin-subjectsrdquo or repeated measuresmodels used in a standard ANOVA The difference isthat whereas in conventional models only participantsare treated as random effects in the present model bothparticipants and items are simultaneously treated as ran-dom effects

An analogous model is placed on verbs

(5)

where Vi j is the ith participantrsquos reading time on the jthverb μv is a grand reading time for verbs αi and γj areparticipant and item effects respectively and ε ij

(v) is anyadditional noise These random effects are modeledanalogously

(6)

and

(7)

We simulated data from this model and performed theconventional analysis on aggregated means Vector no-tation is helpful in describing the simulations Let αα de-note the vector of all subject random effects αα (α1 α2 αI) (Boldface type is reserved for vectors and ma-trices) The goal is to assess Type I error rates conse-quently data were simulated with no true difference inreading times between nouns and verbs (μn μv) Eachreplicate of the simulation starts with simulating partic-

ipant random effects (αα) noun-item random effects (ββ )and verb-item random effects (γγ ) as draws from normaldistribution In our first simulation there were 50 hypo-thetical participants each observing 50 nouns and 50verbs Hence there were 100 values of αα and 50 valueseach of ββ and γγ These random effects were kept constantthroughout a replication Next the values of the noiseεε (n) and εε (v) were sampled There was a total of 5000samples for each type of noise one for each participant-by-item combination Then scores N and V were com-puted by adding the grand means random effects andnoise in accordance with Equations 1 and 5 Mean scoresfor each participant in each part-of-speech conditionwere tabulated and submitted to a paired t test Therewere 500 independent replicates per simulation

We performed several simulations by manipulating σ 12

and σ 22 (σ 2 was set to 1 and serves to scale the other pa-

rameter values) The main result is that the real Type Ierror rate is a function of item variability (σ 2

2) Figure 1shows the proportion of Type I errors (significant t testresults) as a function of item variability The filled cir-cles are error rates for 50 hypothetical participants ob-serving 50 nouns and 50 verbs the circles with hatchedlines are error rates for 20 hypothetical participants ob-serving 20 nouns and 20 verbs With no item variabilitythe Type I error rate equals the nominal value of 05 Asitem variability is increased however the Type I errorrate increases and does so dramatically For example forthe simulation of the larger experiment when item vari-ability is only one third that of σ 2 the real Type I errorrate is around 40 This is a surprisingly high rate

The intuitive reason for the increased Type I error rategoes as follows For each replicate the aggregate itemscores mean(ββ ) and mean(γγ ) vary This variation af-fects all participants equally In effect this variation in-duces a correlation across participants If a sampled setof items is a bit quicker than usual all participants will beequally affected This correlation violates the independent-observations assumption of the t test It is not surprisingthen that there is an increase in Type I error rate

The analysis above is termed participant analysissince the data were aggregated across items to produceparticipant-specific scores (Baayen Tweedie amp Schreu-der 2002) One alternative would be item analysis inwhich data are aggregated across participants A meanreading score is then tabulated for each item and themean scores are submitted to an appropriate t test for in-ference Unfortunately the Type I error rate of this t testis inflated by participant variability Another alternativeis to perform both item and participant analyses Unfor-tunately this alternative is also flawed for if there is bothitem and participant variability each of these tests hasan inflated Type I error rate

There are valid statistical procedures for this problemClark (1973) proposed a correction a quasi-F statisticthat accounts for item variability This correction workswell (Forster amp Dickinson 1976 Raaijmakers Schrijne-makers amp Gremmen 1999) More recently Baayen

ε σij( ) v ind

~ Normal 0 2( )

γ σjind~ Normal 0 2

2( )

Vij i j ij= + + +μ α γ εvv( )

ε σij( ) n ind

~ Normal 0 2( )

β σjind~ Normal 0 2

2 ( )α σi

ind~ Normal 0 1

2 ( )

Nij i j ij= + + +μ α β εnn( )

HIERARCHICAL BAYESIAN MODELS 575

et al (2002) proposed a mixed linear model that ac-counts for both participant and item variation Experi-mentalists however have a more intuitive approachreplication The more a finding is replicated the lowerthe chance that it is due to a Type I error For exampleconsider two independent researchers who replicate eachotherrsquos experiment at a nominal Type I error rate of 05Assume that due to unaccounted item variability the ac-tual Type I error rate for each replication is 2 If bothreplications are significant the combined Type I errorrate is 04 which is below the nominal criterion Psy-chologists do not often use strict replication but insteaduse near replications in which there are only minor pro-cedural differences across replicates Consider thebizarre memory example above Although Riefer andRouder (1992) used aggregation to conclude that bizarresentences are better recalled than common ones thebasic finding has been obtained repeatedly (see egEinstein McDaniel amp Lackey 1989 Hirshman Whel-ley amp Palij 1989 Pra Baldi de Beni Cornoldi amp Cave-don 1985 Wollen amp Cox 1981) so it is surely not theresult of a Type I error Oft-replicated phenomena suchas the Stroop effect and semantic priming effects arecertainly not spurious

The reason that replication is feasible in linear con-texts (such as those underlying both t tests and ANOVA)is that population means can be estimated without biaseven when there is unmodeled variability For examplein our simulation of Type I error rates estimates of truecondition means were quite accurate and showed no ap-parent bias Consequently the true difference betweentwo groups can be estimated without bias and may be ob-tained with greater accuracy by increasing sample sizeIn the case in which there is no true difference increas-ing the sample size yields increasingly better estimatesof the null group difference

The situation is not nearly so sanguine for nonlinearmodels Examples of nonlinear models include signal

detection (Green amp Swets 1966) process dissociation(Egan 1975 Jacoby 1991) the diffusion model (Rat-cliff 1978 Ratcliff amp Rouder 1998) the fuzzy logicalmodel of perception (Massaro amp Oden 1979) the hy-brid deadline model (Rouder 2000) the similaritychoice model (Luce 1963) the generalized contextmodel (Medin amp Schaffer 1978 Nosofsky 1986) andthe interactive activation model (McClelland amp Rumel-hart 1981) In fact almost all models used for cognitionand perception outside the ANOVAregression frame-work are nonlinear Nonlinear models are postulated be-cause they are more realistic than linear models Eventhough psychologists have readily adopted nonlinearmodels they have been slow to acknowledge the effectsof unmodeled variability in these contexts These effectsare not good Unmodeled variability often yields dis-torted parameter estimates in many nonlinear modelsFor this reason the assessment of true differences inthese models is difficult

EFFECTS OF VARIABILITY IN SIGNAL DETECTION

To demonstrate the effects of unmodeled variabilityon a nonlinear model we performed a set of simulationsin which the theory of signal detection (Green amp Swets1966) was applied to a recognition memory paradigmConsider the analysis of a recognition memory experi-ment that entails both randomly selected participants anditems In the signal detection model participants moni-tor the familiarity of a target If familiarity is above cri-terion participants report the target as an old item oth-erwise they report it as a new item The distribution offamiliarity is assumed to be greater for old than for newitems and the degree of this difference is the sensitivityHits occur when old items are judged as old false alarmsoccur when new items are judged as old The model isshown in Figure 2 It is reasonable to expect that there is

Figure 1 The effect of unmodeled item variability (σσ2) on Type I error rate whendata are aggregated across items All error rates were computed for a nominal Type Ierror rate of 05

576 ROUDER AND LU

participant-level variability in sensitivity some partici-pants have better mnemonic abilities than others It isreasonable also to expect item-level variability someitems are easier to remember than others

To explore the effects of variability we implementeda simulation similar to the previous one Let d primeij denotethe ith individualrsquos sensitivity to the jth item likewiselet cij denote the ith individualrsquos criterion when assessingthe familiarity of the jth item (cij is measured as the dis-tance from the mean of the new-item distribution seeFigure 2) Consider the following model

dijprime μdαiβj (8)

andcij μcαi 2βj 2 (9)

Parameters μd and μc are grand means parameter αi de-notes the effect of the ith participant and parameter βjdenotes the effect of the jth item The model on d prime is ad-ditive for items and participants in this sense it is anal-ogous to the previous model for reading times One odd-looking feature of this simulation model is thatparticipant and item effects are half as large on criteriaas they are on sensitivity This feature is motivated byGlanzer Adams Iverson and Kimrsquos (1993) mirror ef-fect The mirror effect refers to an often-observed patternin which manipulations that increase sensitivity do so byboth increasing hit rates and decreasing false alarmsConsider the case of unbiased responding shown in Fig-

ure 2 In the figure the criterion is half of the value of thesensitivity If a manipulation were to increase sensitivityby increasing the hit rate and decreasing the false alarmrate in equal increments then criterion must increasehalf as much as the sensitivity gain (like sensitivity cri-terion is measured from 0 the mean of the new-item dis-tribution) In Equations 8 and 9 the particular effect ofa participant or item is balanced in hit and false alarmrates If a particular participant or item is associated withhigh sensitivity the effect is equally strong in hit andfalse alarm rates

Because αi and βj denote the effects from randomlyselected participants and items it is reasonable to modelthem as random effects These are given by

(10)

and

(11)

We simulated data in a manner similar to the previoussimulation First item and participant effects were sam-pled according to Equations 10 and 11 Then these ran-dom effects were combined according to Equations 8and 9 in order to produce underlying sensitivities and cri-teria for each participant item combination On the basisof these true values hypothetical responses were pro-duced in accordance with the signal detection modelshown in Figure 2 These hypothetical responses are di-chotomous A specific participant judges a specific itemas either old or new These responses were aggregatedacross items to produce hit and false alarm rates for eachindividual From these aggregated rates d prime and c werecomputed for each participant Because individualizedestimates of d prime were computed variability across partic-ipants was not problematic (σ1 was set equal to 5) Theleft panel of Figure 3 shows the results of a simulationwith no item effects (σ2 0) As expected d prime estimatesappear unbiased The right panel shows a case with itemvariability (σ2 15) The estimates systematicallyunderestimate true values This underestimation is an as-ymptotic bias it does not diminish as the numbers ofparticipants and items increase We implemented vari-ability in the context of a mirror effect but the underes-timation is also obtained in other implementations Sim-ulations with item variability exclusively in hits result inthe same underestimation Likewise simulations withitem variability in c alone result in the same underesti-mation (see Wickelgren 1968 who makes a comparableargument)

The presence of asymptotic bias is disconcerting Un-like analysis with ANOVA replication does not guaran-tee correct inference In the case of signal detection onecannot tell whether a difference in overall estimated sen-sitivity between two conditions is due to a true sensitiv-ity difference or to an increase in unmodeled variabilityin one condition relative to the other One domain inwhich asymptotic underestimation is particularly perni-

β σjind~ Normal 0 2

2 ( )

α σiind~ Normal 0 1

2( )

Figure 2 The signal detection model Hit and false alarm prob-abilities are the areas of the old-item and new-item familiaritydistributions that are greater than the criterion respectively Val-ues of sensitivity (d primeprime) and criterion (c) are measured from 0 themean of the new-item familiarity distribution

HIERARCHICAL BAYESIAN MODELS 577

cious is signal detection analyses of subliminal semanticactivation Greenwald Draine and Abrams (1996) forexample used signal detection sensitivity aggregatedacross items to claim that a set of word primes was sub-liminal (d prime 0) These subliminal primes then affect re-sponse times to subsequent stimuli The asymptoticdownward bias from item variability however renderssuspect the claim that the primes were truly subliminal

The presence of asymptotic bias with unmodeled vari-ability is not unique to signal detection In general un-modeled variability leads to asymptotic bias in nonlinearmodels Psychologists have long known about problemsassociated with aggregation in specific contexts For ex-ample Estes (1956) warned about aggregating learningcurves across participants Aggregate learning curves maybe more graded than those of individuals leading re-searchers to misdiagnose underlying mechanisms(Haider amp Frensch 2002 Heathcote Brown amp Mewhort2000) Curran and Hintzman (1995) critiqued aggrega-tion in Jacobyrsquos (1991) process dissociation procedureThey showed that aggregating responses over partici-pants items or both possibly leads to asymptotic under-estimation of automaticity estimates Ashby Maddoxand Lee (1994) showed that aggregating data across par-ticipants possibly distorts estimates from similaritychoice-based scaling (such as those from Gilmore HershCaramazza amp Griff in 1979) Each of the examplesabove is based on the same general problem Unmodeledvariability distorts estimates in a nonlinear context In allof these cases the distortion does not diminish with in-creasing sample sizes Unfortunately the field has beenslow to grasp the general nature of this problem The cri-tiques above were made in isolation and appeared as spe-cific critiques of specific models Instead we see them

as different instances of the negative effects of aggregat-ing data in analyses with nonlinear models

The solution is to model both participant and itemvariability simultaneously We advocate Bayesian hier-archical models for this purpose (Rouder Lu Speck-man Sun amp Jiang 2005 Rouder Sun Speckman Luamp Zhou 2003) In a hierarchical model variability onseveral levels such as from the selection of items andparticipants as well as from measurement error is mod-eled simultaneously As was mentioned previously hier-archical models have been used in linear contexts to im-prove power and to better control Type I error rates (seeeg Baayen et al 2002 Kreft amp de Leeuw 1998) Al-though better statistical control is attractive to experi-mentalists it is not critical Instead it is often easier (andwiser) to replicate results than to learn and implementcomplex statistical methodologies For nonlinear mod-els however hierarchical models are critical becausebias is asymptotic

The main drawback to nonlinear hierarchical modelsis tractability They are difficult to implement Startingin the 1980s and 1990s however new computationaltechniques based on Bayesian analysis emerged in sta-tistics The utility of these new techniques is immensethey allow for the analysis of previously intractable mod-els including nonlinear hierarchical ones (Gelfand ampSmith 1990 Gelman Carlin Stern amp Rubin 2004Gill 2002 Tanner 1996) These new Bayesian-basedtechniques have had tremendous impact in many quanti-tatively oriented disciplines including statistics engi-neering bioinformatics and economics

The goal of this article is to provide an introduction toBayesian hierarchical modeling The focus is on a rela-tively new computational technique that has made

No Item Variability Item Variability

True Sensitivity True Sensitivity

Est

imat

ed S

ensi

tivity

30

25

20

15

10

05

0

30

25

20

15

10

05

0

0 05 10 15 20 25 30 0 05 10 15 20 25 30

Figure 3 The effect of unmodeled item variability (σσ2) on the estimation of sensitivity (d primeprime) True values of d primeprimeare the means of d primeprimeij across items The left and right panels respectively show the results without and with itemvariability (σσ 2 15) For both plots the value of σσ 1 was 5

578 ROUDER AND LU

Bayesian analysis more tractable Markov chain MonteCarlo sampling In the first section basic issues of esti-mation are discussed Then this discussion is expandedto include the basics of Bayesian estimation Followingthat Markov chain Monte Carlo integration is intro-duced Finally a series of hierarchical signal detectionmodels is presented These models provide unbiased andrelatively efficient sensitivity estimates for the standardequal-variance signal detection model even in the pres-ence of both participant and item variability

ESTIMATION

It is easier to discuss Bayesian estimation in the con-text of a simple example Consider the case in which asingle participant performs N trials The result of eachtrial is either a success or a failure We wish to estimatethe underlying true probability of a success denoted pfor this single participant Estimating a probability ofsuccess is essential to many applications in cognitivepsychology including signal detection analysis In signaldetection we typically estimate hit and false alarm ratesHits are successes on old-item trials and false alarms arefailures on new-item trials

A formula that provides an estimate is termed an esti-mator A reasonable estimator for this case is

(12)

where y is the number of successes out of N trials Esti-mator p0 is natural and straightforward

To evaluate the usefulness of estimators statisticiansusually discuss three basic properties bias efficiencyand consistency Bias and efficiency are illustrated inTable 1 The data are the results of weighing a hypothet-ical person of 170 lb on two hypothetical scales four sep-arate times Bias refers to the mean of repeated estimatesScale A is unbiased because the mean of the estimatesequals the true value of 170 lb Scale B is biased Themean is 172 lb which is 2 lb greater than the true valueof 170 lb Scale B however has a smaller degree of errorthan does Scale A so Scale B is termed more efficientthan Scale A Efficiency is the inverse of the expectederror of an observation and may be indexed by the reci-procal of root mean squared error Bias and efficiencyhave the same meaning for estimators as they do forscales Bias refers to the difference between the averagevalue of an estimator and a true value Efficiency refers

to the average magnitude of the difference between anestimator and a true value In many situations efficiencydetermines the quality of inference more than bias does

How biased and efficient is estimator p0 To providea context for evaluation consider the following two al-ternative estimators

(13)

and

(14)

These two alternatives may seem unprincipled but as isdiscussed in the next section they are justif ied in aBayesian framework

Figure 4 shows sampling distributions for the threeprobability estimators when the true probability is p 7and there are N 10 trials Estimator p0 is not biasedbut estimators p1 and p2 are Surprisingly estimator p2has the lowest average error that is it is the most effi-cient The reason is that the tails of the sampling distri-bution are closer to the true value of p 7 for p2 thanfor p1 or p0 Figure 4 shows the case for a single truevalue of p 7 Figure 5 shows bias and efficiency forall three estimators for the full range of p The conven-tional estimator p0 is unbiased for all true values of pbut the other two estimators are biased for extreme prob-abilities None of the estimators is always more efficientthan the others For intermediate probabilities estimatorp2 is most efficient for extreme probabilities estimatorp0 is most efficient Typically researchers have someidea of what type of probability of success to expect intheir experiments This knowledge can therefore be usedto help pick the best estimator for a particular situation

The other property of estimators is consistency Aconsistent estimator converges to its true value As thesample size is increased the estimator not only becomesunbiased the overall variance shrinks toward 0 Consis-tency should be viewed as a necessary property of a goodestimator Fortunately many estimators used in psychol-ogy are consistent including sample means sample vari-ances and sample correlation coefficients The three es-timators p0 p1 and p2 are consistent with sufficientdata they converge to the true value of p In fact all es-timators presented in this article are consistent

BAYESIAN ESTIMATION OF APROBABILITY

In this section we provide a Bayesian approach to es-timating the probability of success The datum in this ap-plication is the number of successes (y) and this quan-tity may be modeled as a binomial

y ~ Binomial(N p)

where N is known and p is the unknown parameter of interest

ˆ py

N212

= ++

ˆ p

yN1

51

= ++

ˆ pyN0 =

Table 1Weight of a 170-Pound Person Measured on

Two Hypothetical Scales

Scale A Scale B

Data (lb) 180 160 175 165 174 170 173 171Mean 170 172Bias 0 20RMSE 791 255

HIERARCHICAL BAYESIAN MODELS 579

Bayesrsquos (1763) insightful contribution was to use thelaw of conditional probability to estimate the parameterp He noted that

(15)

We describe each of these terms in orderThe main goal is to find f ( p | y) the quantity on the

left-hand side of Equation 15 The term f( p | y) is the dis-tribution of the parameter p given the data y This distri-bution is referred to as the posterior distribution It isviewed as a function of the parameter p The mean of theposterior distribution termed the posterior mean servesas a suitable point estimator for p

The term f(y | p) plays two roles in statistics depend-ing on context When it is viewed as a function of y forknown p it is known as the probability mass functionThe probability mass function describes the probabilityof any outcome y given a particular value of p For a bi-nomial random variable the probability mass function isgiven by

(16)

For example this function can be used to compute theprobability of observing a total of two successes on threetrials when p 5

38

The term f(y | p) can also be viewed as a function ofparameter p for fixed y In this case it is termed the like-lihood function and describes the likelihood of a partic-ular parameter value p given a fixed value of y For thebinomial the likelihood function is given by

(17)

In Bayesrsquos theorem we are interested in the posterior asa function of the parameter p Hence the right-handterms of Equation 15 are viewed as a function of param-eter p Therefore f (y | p) is viewed as the likelihood of prather than the probability mass of y

The term f (p) is the prior distribution and reflects theexperimenterrsquos a priori beliefs about the true value of p

Finally the term f (y) is the distribution of the datagiven the model Although its interpretation is importantin some contexts it plays a minimal role in the develop-

f y pN

yp p py N y( | ) ( ) =

⎛⎝⎜

⎞⎠⎟

minus le leminus1 0 1

f y p( | ) ( ) ( )= = =⎛⎝⎜

⎞⎠⎟

minus2 53

25 1 52 1

f y p

N

yp p y Ny N y

( | )( )

=⎛⎝⎜

⎞⎠⎟

minus =minus1 0

0 otherwiise

⎧⎨⎪

⎩⎪

f p yf y p f p

f y( | )

( | ) ( )( )

=

Figure 4 Sampling distributions of p0 p1 and p2 for N 10 trialswith a p 7 true probability of success on any one trial Bias and rootmean squared error (RMSE) are included

580 ROUDER AND LU

ment presented here for the following reason The pos-terior is a function of the parameter of interest p butf (y) does not depend on p The term f (y) is a constantwith respect to p and serves as a normalizing constantthat ensures that the posterior density integrates to 1 Aswill be shown in the examples the value of this normal-izing constant usually becomes apparent in analysis

To perform Bayesian analysis the researcher mustchoose a prior distribution For this application of esti-mating a probability from binomially distributed datathe beta distribution serves as a suitable prior distribu-tion for p The beta is a flexible two-parameter distribu-tion see Figure 6 for examples In contrast to the normaldistribution which has nonzero density on all numbersthe beta has nonzero density between 0 and 1 The betadistribution is a function of two parameters a and b thatdetermine its shape The notation for specifying a betadistribution prior for p is

p ~ Beta(a b)

The density of a beta random variable is given by

(18)

The denominator Be(a b) is termed the beta function1

Like f (y) it is not a function of p and plays the role of anormalizing constant in analysis

In practice researchers must completely specify theprior distribution before analysis that is they must choosesuitable values for a and b This choice reflects re-searchersrsquo beliefs about the possible values of p beforedata are collected When a 1 and b 1 (see the mid-

dle panel of Figure 6) the beta distribution is flat that isthere is equal density for all values in the interval [0 1]By choosing a b 1 the researcher is committing toall values of p being equally likely before data are col-lected Researchers can choose values of a and b to bestmatch their beliefs about the expected data For exam-ple consider a psychophysical experiment in which thevalue of p will almost surely be greater than 5 For thisexperiment priors with a b are appropriate

The goal is to derive the posterior distribution usingBayesrsquos theorem (Equation 15) Substituting the likeli-hood function (Equation 17) and prior (Equation 18) intoBayesrsquos theorem yields

Collecting terms in p yields

f ( p | y) kp (ya1)(1p)(Nyb1) (19)

where

The term k is constant with respect to p and serves as anormalizing constant Substituting

aprime y a bprime N y b (20)

into Equation 19 yields

f ( p | y) kp aprime1(1p)bprime1

This equation is proportional to a beta density for pa-rameters aprime and bprime To make f ( p | y) a proper probabilitydensity k has a value such that f ( p | y) integrates to 1

k a b f yN

y= times

⎛⎝⎜

⎞⎠⎟

minus[ ( ) ( )] Be 1

f p yN

y

p p p pa b

y N y a b

( | )( ) ( )

( =

⎛⎝⎜

⎞⎠⎟

minus minusminus minus minus1 11 1

Be )) ( )

times f y

f pp p

a b

a b

( )( )

( )= minusminus minus1 11

Be

Figure 5 Bias and root mean squared error (RMSE) for thethree estimators as functions of true probability of success Thesolid dashed and dasheddotted lines denote the characteristicsof p0 p1 and p2 respectively

Figure 6 Probability density function of the beta distributionfor various values of parameters a and b

HIERARCHICAL BAYESIAN MODELS 581

This occurs only if the value of k is a beta functionmdashforexample k 1Be(a prime bprime) Consequently

(21)

The posterior like the prior is a beta but with param-eters aprime and bprime given by Equation 20

The derivation above was greatly facilitated by sepa-rating those terms that depend on the parameter of inter-est from those that do not The latter terms serve as anormalizing constant and can be computed by ensuringthat the posterior integrates to 10 In most derivationsthe value of the normalizing constant becomes apparentmuch as it did above Consequently it is convenient toconsider only those terms that depend on the parameterof interest and to lump the rest into a proportionality con-stant Hence a convenient form of Bayesrsquos theorem is

Posterior distribution Likelihood function Prior distribution

The symbol ldquordquo denotes proportionality and is read isproportional to This form of Bayesrsquos theorem is usedthroughout this article

Figure 7 shows both the prior and posterior of p forthe case of y 7 successes in N 10 trials The priorcorresponds to a beta with a b 1 and the observeddata are 7 successes in 10 trials The posterior in thiscase is a beta distribution with parameters aprime 8 andbprime 4 As can be seen the posterior is considerablymore narrow than the prior indicating that a large degreeof information has been gained from the data There aretwo valuable quantities derived from the posterior distri-bution the posterior mean and the 95 highest densityregion (also termed the 95 credible interval) The for-mer is the point estimate of p and the latter serves anal-ogously to a confidence interval These two quantitiesare also shown in Figure 7

For the case of a beta-distributed posterior the ex-pression for the posterior mean is

The posterior mean reduces to p1 if a b 5 it reducesto p2 if a b 1 Therefore estimators p1 and p2 aretheoretically justified from a Bayesian perspective Theestimator p0 may be obtained for values of a b 0 Forthese values however the integral of the prior distribu-tion is infinite If the integral of a distribution is infinitethe distribution is termed improper Conversely if the integral is finite the distribution is termed proper InBayesian analysis it is essential that posterior distribu-tions be proper However it is not necessary for the priorto be proper In some cases an improper prior will stillyield a proper posterior When this occurs the analysisis perfectly valid For estimation of p the posterior isproper even when a b 0

When the prior and posterior are from the same distri-bution the prior is termed conjugate The beta distribu-tion is the conjugate prior for binomially distributed databecause the resulting posterior is also a beta distributionConjugate priors are convenient they often facilitate thederivation of posterior distributions Conjugacy howeveris not at all necessary for Bayesian analysis A discussionabout conjugacy is provided in the General Discussion

BAYESIAN ESTIMATION WITH NORMALLY DISTRIBUTED DATA

In this section we present Bayesian analysis for nor-mally distributed data The normal plays a critical role inthe hierarchical signal detection model We will assumethat participant and item effects on sensitivity and biasare distributed as normals The results and techniquesdeveloped in this section are used directly in analyzingthe subsequent hierarchical signal detection models

Consider the case in which a sequence of observa-tions w (w1 wN) are independent and identicallydistributed as normals

The goal is to estimate parameters μ and σ 2 In this sec-tion we discuss a more limited goalmdashthe estimation ofone of the parameters assuming that the other is knownIn the next section we will introduce a form of Markovchain Monte Carlo sampling for estimation when bothparameters are unknown

Estimating μμ With Known σσ 2

Assume that σ 2 is known and that the estimation of μis of primary interest The goal then is to derive the pos-

wjind~ Normal μ σ 2( )

p aa b

y aN a b

= primeprime + prime

= ++ +

f p yp p

a b

a b

( | )( )( )

= minusprime prime

primeminus primeminus1 11Be

Figure 7 Prior and posterior distributions for 7 successes outof 10 trials Both are distributed as betas Parameters for theprior are a 1 and b 1 parameters for the posterior are aprimeprime 8 and b primeprime 4 The posterior mean and 95 highest density regionare also indicated

582 ROUDER AND LU

terior f (μ |σ 2 w) According to the proportional formof Bayesrsquos theorem

All terms are conditioned on σ 2 to reflect the fact that itis known

The first step is to choose a prior distribution for μThe normal is a suitable choice

Parameters μ0 and σ 02 must be specified before analysis

according to the researcherrsquos beliefs about μ By choos-ing σ 0

2 to be increasingly large the researcher can makethe prior arbitrarily variable In the limit the prior ap-proaches having equal density across all values Thisprior is considered noninformative for μ

The density of the prior is given by

Note that the prior on μ does not depend on σ 2 there-fore f (μ) f (μ |σ2)

The next step is multiplying the likelihood f (w | μσ 2) by the prior density f (μ |σ 2) For normally distrib-uted data the likelihood is given by

(22)

Multiplying the equation above by the prior yields

We concern ourselves with only those terms that involveμ the parameter of interest Other terms are used to nor-malize the posterior density and are absorbed into theconstant of proportionality

where

(23)

and

(24)

The posterior is proportional to the density of a normalwith mean and variance given by μprime and σ 2prime respectivelyBecause the conditional posterior distribution must inte-grate to 10 this distribution must be that of a normalwith mean μprime and variance σ 2prime

(25)Unfortunately the notation does not make it clear thatboth μprime and σ 2prime depend on the value of σ 2 When this de-pendence is critical it will be made explicit μprime(σ 2) andσ 2prime(σ 2)

Because the posterior and prior are from the samefamily the normal distribution is the conjugate prior forthe population mean with normally distributed data (withknown variance) In the limit that the prior is diffuse (ieas its variance is made increasingly large) the posteriorof μ is a normal with μprime w and σ 2prime σ 2N This dis-tribution is also the sampling distribution of the samplemean in classical statistics Hence for the diffuse priorthe Bayesian and conventional approaches yield the sameresults

Figure 8 shows the estimation for an example in whichthe data are w (112 106 104 111) and the variance isassumed known to be 16 The mean of these four obser-vations is 10825 For demonstration purposes the prior

μ σ μ σ| ~ 2 2w Normal prime( )prime

prime = +⎛⎝⎜

⎞⎠⎟

minus minus primesumμ σ σ μ σ20

20

2w jj

σ σ σprime minus minus minus= +( )2 2

02 1

N

f μ σμ μ

σ| exp 2

2

22w( ) minus minus prime( )⎛

⎝⎜⎜

⎠⎟⎟prime

f N( | ) expμ σσ σ

μ22

02

212

1w minus +⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

⎣⎢⎢

⎝⎜

minusminussum

+⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

+sum

+⎧⎨

22

0

02

2

202

02

w

w

jj

jj

σμσ

μ

σμσ

⎪⎪

⎩⎪

⎫⎬⎪

⎭⎪

⎦⎥⎥

⎠⎟

fw wj jjμ σ

μ μ

σ

μ

| exp22 2

212

2w( ) minus

minus +( )⎡

⎣⎢⎢

⎝⎜⎜

+

sum

220 0

2

02

2minus + ⎤

⎦⎥⎥

⎠⎟

μ μ μσ

fw j

j

μ σμ

σ

μ μ

| exp

exp

2

2

2

0

2w( ) minus minus( )⎛

⎜⎜

⎟⎟

timesminus minus

sum

(( )⎛

⎝⎜⎜

⎠⎟⎟

2

022σ

fw

N

j

j

μ σπσ

μ

σ| exp

2

2 2

2

21

2 2w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

timesminus minus( )⎛

⎝⎜⎜

⎠⎟⎟

1

2 20

0

2

02π σ

μ μ

σexp

fw

N

j

j

w | exp

μ σπσ

μ

σ2

2 2

2

21

2 2( ) =

( )minus minus( )⎛

⎜⎜

⎟sum ⎟⎟

f ( ) exp μπσ

μ μσ

=minus minus( )⎛

⎝⎜⎜

⎠⎟⎟

1

2 20

0

2

02

μ μ σ~ Normal 0 02( )

f f fμ σ μ σ μ σ| | | 2 2 2w w( ) ( ) ( )

HIERARCHICAL BAYESIAN MODELS 583

is constructed to be fairly informative with μ0 100and σ 0

2 100 The posterior is narrower than the priorreflecting the information gained from the data It is cen-tered at μprime 1079 with variance 385 Note that theposterior mean is not the sample mean but is modestlyinfluenced by the prior Whether the Bayesian or the con-ventional estimate is better depends on the accuracy ofthe prior For cases in which much is known about thedependent measure it may be advantageous to reflectthis information in the prior

Estimating σσ 2 With Known μμIn the last section the posterior for μ was derived for

known σ 2 In this section it is assumed that μ is knownand that the estimation of σ 2 is of primary interest Thegoal then is to estimate the posterior f (σ 2 | μ w) The inverse-gamma distribution is the conjugate prior for σ 2

with normally distributed data The inverse-gamma dis-tribution is not widely used in psychology As its nameimplies it is related to the gamma distribution If a ran-dom variable g is distributed as a gamma then 1g is dis-tributed as an inverse gamma An inverse-gamma prioron σ 2 has density

The parameters of the inverse gamma are a and b and inapplication these are chosen beforehand As a and b ap-proach 0 the inverse gamma approaches 1σ 2 which isconsidered the appropriate noninformative prior for σ 2

(Jeffreys 1961)

Multiplying the inverse-gamma prior by the likelihoodgiven in Equation 22 yields

Collecting only those terms dependent on σ 2 yields

where

aprime N2 a (26)and

(27)

The posterior is proportional to the density of an inversegamma with parameters aprime and bprime Because this posteriorintegrates to 10 it must also be an inverse gamma

σ 2 | μ w ~ Inverse Gamma (aprime bprime) (28)

Note that posterior parameter bprime depends explicitly onthe value of μ

MARKOV CHAIN MONTE CARLO SAMPLING

The preceding development highlights a problem Theuse of conjugate priors allowed for the derivation of pos-teriors only when some of the parameters were assumedto be known The posteriors in Equations 25 and 28 areknown as conditional posteriors because they depend onother parameters The quantities of primary interest arethe marginal posteriors μ | w and σ 2 | w The straightfor-

prime = minus( )⎡⎣⎢

⎤⎦⎥

+sumb w bj j μ2

2

f ba

σ μσ σ

2

2 1 21| exp w( )

( )minus prime⎛

⎝⎜⎞⎠⎟prime+

f

w b

N a

jj

σ μσ

μ

σ

2

2 2 1

2

2

1

5

|

exp

w( )

( )times minus

minus( ) +⎛

+ +

sum

⎝⎝

⎜⎜

⎟⎟

fw

N

j

j

σ μσ

μ

σ2

2 2

2

21

2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟⎟

times

sum

112 1 2

σ σ( )minus⎛

⎝⎜⎞⎠⎟+a

bexp

fw

N

j

j

σ μπσ

μ

σ2

2 2

2

21

2 2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

times( )

minus⎛⎝⎜

⎞⎠⎟+

b

a

ba

aΓ( )

exp σ σ2 1 2

f b

a

b a ba

σ σσ2

2 1 22 0( ) =

( )minus⎛

⎝⎜⎞⎠⎟

ge+

Γ( )exp

Figure 8 Prior and posterior distributions for estimating themean of normally distributed data with known variance Theprior and posterior are the dashed and solid distributions re-spectively The vertical line shows the sample mean The priorldquopullsrdquo the posterior slightly downward in this case

584 ROUDER AND LU

ward method of obtaining marginals is to integrate conditionals

In this case the integral is tractable but in many cases itis not When an integral is symbolically intractable how-ever it can often still be evaluated numerically One ofthe recent developments in numeric integration is Markovchain Monte Carlo (MCMC) sampling MCMC sam-pling has become quite popular in the last decade and isdescribed in detail in Gelfand and Smith (1990) We showhere how MCMC can be used to estimate the marginalposterior distributions of μ and σ 2 without assumingknown values We use a particular type of MCMC sam-pling known as Gibbs sampling (Geman amp Geman 1984)Gibbs sampling is a restricted form of MCMC the moregeneral form is known as MetropolisndashHastings MCMC

The goal is to find the marginal posterior distribu-tions MCMC provides a set of random samples from themarginal posterior distributions (rather than a closed-form derivation of the posterior density or cumulativedistribution function) Obtaining a set of random sam-ples from the marginal posteriors is sufficient for analy-sis With a sufficiently large sample from a random vari-able one can compute its density mean quantiles andany other statistic to arbitrary precision

To explain Gibbs sampling we will digress momen-tarily using a different example Consider the case ofgenerating samples from an ex-Gaussian random vari-able The ex-Gaussian is a well-known distribution incognitive psychology (Hohle 1965) It is the sum of nor-mal and exponential random variables The parametersare μ σ and τ the location and scale of the normal andthe scale of the exponential respectively An ex-Gaussianrandom variable x can be expressed in conditional form

andη ~ Exponential(τ )

We can take advantage of this conditional form to gen-erate ex-Gaussian samples First we generate a set ofsamples from η Samples will be denoted with squarebrackets [η]1 denotes the first sample [η]2 denotes thesecond and so on These samples can then be used in theconditional form to generate samples from the marginalrandom variable x For example the first two samples aregiven by

and

In this manner a whole set of samples from the mar-ginal x can be generated from the conditional randomvariable x |η Figure 9 shows that the method works Ahistogram of [x] generated in this manner converges tothe true ex-Gaussian density This is an example of MonteCarlo integration of a conditional density

We now return to the problem of deriving marginalposterior distributions of μ | w from the conditionalμ |σ 2 w If there was a set of samples from σ 2 | w MonteCarlo integration could be used to sample μ | w Like-wise if there was a set of samples of μ | w Monte Carlointegration could be used to sample σ 2 | w In Gibbs sam-pling these relations are used iteratively as followsFirst a value of [σ 2]1 is picked arbitrarily This value isthen used to sample the posterior of μ from the condi-tional in Equation 25

where μprime and σ 2 prime are the explicit functions of σ 2 givenin Equations 23 and 24 Next the value of [μ]1 is used tosample σ 2 from the conditional posterior in Equation 28In general for iteration m

(29)

and

(30)

In this manner sequences of random samples [μμ] and[σσ 2] are obtained

The initial samples of [μμ] and [σσ 2] certainly reflect thechoice of starting values [σ 2]1 and are not samples fromthe desired marginal posteriors It can be shown how-ever that under mild technical conditions the later sam-ples of [μμ] and [σσ 2] are samples from the desired mar-ginal posteriors (Tierney 1994) The initial region in

σ μ21

⎡⎣ ⎤⎦ prime prime ⎡⎣ ⎤⎦( )( )minusm ma b~ Inverse Gamma

[ ] μ μ σ σ σm m m~ Normal prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime2 2 2

μ μ σ σ σ⎡⎣ ⎤⎦ prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime1

2 2~ Normal1

2

1

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )2 22~ Normal +μ η σ

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )1 12~ Normal +μ η σ

x | ~ η μ η σNormal +( )2

f f f d( | ) | | μ μ σ σ σw w w= ( ) ( )int 2 2 2

Figure 9 An example of Monte Carlo integration of a normalconditional density to yield ex-Gaussian distributed samples Thehistogram presents the samples from a normal whose mean pa-rameter is distributed as an exponential The line is the densityof the appropriate ex-Gaussian distribution

HIERARCHICAL BAYESIAN MODELS 585

which the samples reflect the starting value is termed theburn-in period The later samples are the steady-state pe-riod Formally speaking the samples form irreducibleMarkov chains with stationary distributions that equalthe marginal posterior distribution (see Gilks Richard-son amp Spiegelhalter 1996) On a more informal levelthe first set of samples have random and arbitrary valuesAt some point by chance the random sample happens tobe from a high-density region of the true marginal pos-terior From this point forward the samples are from thetrue posteriors and may be used for estimation

Let m0 denote the point at which the samples are ap-proximately steady state Samples after m0 can be used toprovide both posterior means and credible regions of theposterior (as well as other statistics such as posteriorvariance) Posterior means are estimated by the arith-metic means of the samples

and

where M is the number of iterations Credible regions areconstructed as the centermost interval containing 95of the samples from m0 to M

To use MCMC to estimate posteriors of μ and σ 2 it isnecessary to sample from a normal and an inverse-gamma

distribution Sampling from a normal is provided in manysoftware packages and routines for low-level languagessuch as C or Fortran are readily available Samples froman inverse-gamma distribution may be obtained by tak-ing the reciprocal of samples from a gamma distributionAhrens and Dieter (1974 1982) provide algorithms forsampling the gamma distribution

To illustrate Gibbs sampling consider the case of es-timating IQ The hypothetical data for this example arefour replicates from a single participant w (112 106104 111) The classical estimation of μ is the samplemean (10825) and confidence intervals are constructedby multiplying the standard error (sw193) by the ap-propriate critical t value with three degrees of freedomFrom this multiplication the 95 confidence intervalfor μ is [1022 1144] Bayesian estimation was donewith Gibbs sampling Diffuse priors were placed on μ(μ0 0 σ 0

2 106) and σ 2 (a 106 b 106) Twochains of [μμ] are shown in the left panel of Figure 10each corresponding to a different initial value for [σ 2]1As can be seen both chains converge to their steady staterather quickly For this application there is very littleburn-in The histogram of the samples of [μμ] after burn-in is also shown (right panel) The histogram conforms toa t distribution with three degrees of freedom The poste-rior mean is 10826 and the 95 credible interval is[1022 1144] For normally distributed data the diffusepriors used in Bayesian analysis yield results that are nu-merically equivalent to their frequentist counterparts

The diffuse priors used in the previous case representthe situation in which researchers have no previous knowl-

σσ

2

2

0

0=

⎡⎣ ⎤⎦

minusgtsum

mm m

M m

μμ

=minus

gtsum [ ]m

m m

M m0

0

15

10

05

0

120

115

110

105

100

0 10 20 30 40 50 95 100 105 110 115 120IQ PointsIteration

Den

sity

IQ points

micro

Burn-in

Figure 10 Samples of μμ from Markov chain Monte Carlo integration for normally dis-tributed data The left panel shows two chains each from a poor choice of the initial value ofσσ 2 (the solid and dotted lines are for [σσ 2]1 10000 and [σσ 2]1 0001 respectively) Con-vergence is rapid The right panel shows the histogram of samples of μμ after burn-in Thedensity is the appropriate frequentist t distribution with three degrees of freedom

586 ROUDER AND LU

edge In the IQ example however much is known aboutthe distribution of IQ scores Suppose our test was normedfor a mean of 100 and a standard deviation of 15 Thisknowledge could be profitably used by setting μ0 100and σ 0

2 152 225 We may not know the particulartestndashretest correlation of our scales but we can surmisethat the standard deviations of replicates should be lessthan the population standard deviation After experiment-ing with the values of a and b we chose a 3 and b 100 The prior on σ 2 corresponding to these values isshown in Figure 11 It has much of its density below 100reflecting our belief that the testndashretest variability issmaller than the variability in the population Figure 11also shows a histogram of the posterior of μ (the posteriorwas obtained with Gibbs sampling as outlined above[σ 2]1 1 burn-in of 100 iterations) The posterior meanis 10795 which is somewhat below the sample mean of10825 This difference is from the prior people on aver-age score lower than the obtained scores Although it is abit hard to see in the figure the posterior distribution ismore narrow in the extreme tails than is the correspond-ing t distribution The 95 credible interval is [10211136] which is modestly different from the correspond-ing frequentist confidence interval This analysis shows

that the use of reasonable prior information can lead to aresult different from that of the frequentist analysis TheBayesian estimate is not only different but often more ef-ficient than the frequentist counterpart

The theoretical underpinnings of MCMC guaranteethat infinitely long chains will converge to the true pos-terior Researchers must decide how long to burn in chainsand then how long to continue in order to approximatethe posterior well The most important consideration inthis decision is the degree of correlation from one sam-ple to another In MCMC sampling consecutive samplesare not necessarily independent of one another The valueof a sample on cycle m 1 is in general dependent onthat of m If this dependence is small then convergencehappens quickly and relatively short chains are neededIf this dependence is large then convergence is muchslower One informal graphical method of assessing thedegree of dependence is to plot the autocorrelation func-tions of the chains The bottom right panel of Figure 11shows that for the normal application there appears to beno autocorrelation that is there is no dependence fromone sample to the next This fact implies that conver-gence is relatively rapid Good approximations may beobtained with short burn-ins and relatively short chains

Figure 11 Analysis with informative priors The two top panels showthe prior distributions on μμ and σσ 2 The bottom left panel shows the his-togram of samples of μμ after burn-in It differs from a t distribution fromfrequentist analysis in that it is shifted toward 100 and has a smallerupper tail The bottom right panel shows the autocorrelation functionfor μμ There is no discernible autocorrelation

HIERARCHICAL BAYESIAN MODELS 587

There are a number of formal procedures for assessingconvergence in the statistical literature (see eg Gel-man amp Rubin 1992 Geweke 1992 Raftery amp Lewis1992 see also Gelman et al 2004)

A PROBIT MODEL FOR ESTIMATING A PROBABILITY OF SUCCESS

The next step toward a hierarchical signal detectionmodel is to revisit estimation of a probability from indi-vidual trials Once again y successes are observed in Ntrials and y ~ Binomial(N p) where p is the parameterof interest In the previous section a beta prior was placedon parameter p and the conjugacy of the beta with the bi-nomial directly led to the posterior Unfortunately a betadistribution prior is not convenient as a prior for the sig-nal detection application

The signal detection model relies on the probit trans-form Figure 12 shows the probit transform which mapsprobabilities into z values using the normal inverse cu-mulative distribution function Whereas p ranges from 0to 1 z ranges across the real number line Let Φ denotethe standard normal cumulative distribution function Withthis notation p Φ(z) and conversely z Φ1( p)

Signal detection parameters are defined in terms ofprobit transforms

and

where p(h) and p(f ) are hit and false alarm probabilities re-spectively As a precursor to models of signal detectionwe develop estimation of the probit transform of a proba-bility of success z Φ1( p) The methods for doing sowill then be used in the signal detection application

When using a probit transform we model the y suc-cesses in N trials as

y ~ Binomial[N Φ(z)]

where z is the parameter of interestWhen using a probit it is convenient to place a normal

prior on z

(31)

Figure 13 shows various normal priors on z and by trans-form the corresponding prior on p Consider the specialcase when a standard normal is placed on z As is shownin the figure the corresponding prior on p is flat A flatprior also corresponds to a beta with a b 1 (see Fig-ure 6) From the previous development if a beta prior isplaced on p the posterior is also a beta with parametersaprime a y and bprime b (N y) Noting for a flat priorthat a b 1 it is expected that the posterior of p is dis-tributed as a beta with parameters aprime 1 y and bprime 1 (N y)

The next step is to derive the posterior f (z |Y ) for anormal prior on z The straightforward approach is tonote that the posterior is proportional to the likelihoodmultiplied by the prior

This equation is intractable Albert and Chib (1995) pro-vide an alternative for analysis and the details of this al-ternative are provided in Appendix A The basic idea isthat a new set of latent variables is defined upon which zis conditioned Conditional posteriors for both z andthese new latent variables are easily derived and sam-pled These samples can then be used in Gibbs samplingto find a marginal posterior for z

In this application it may be desirable to estimate theposterior of p as well as that of z This estimation is easyin MCMC Because p is the inverse probit transform ofz samples can also be inversely transformed Let [z] bethe chain of samples from the posterior of z A chain ofposterior samples of p is given by [ p]m Φ( [z]m)

The top right panel of Figure 14 shows the histogramof the posterior of p for y 7 successes on N 10 tri-als The prior parameters were μ0 0 and σ 0

2 1 so theprior on p was flat (see Figure 13) Because the prior isflat it corresponds to a beta with parameters (a 1 b 1) Consequently the expected posterior is a beta withdensity aprime y 1 8 and bprime (N y) 1 4 Thechains in the Gibbs sampler were started from a numberof values of [z]1 and convergence was always rapid Theresulting histogram for 10000 samples is plotted it ap-proximates the expected distribution The bottom panelshows a small degree of autocorrelation This autocorre-lation can be overcome by running longer chains andusing a more conservative burn-in period Chains of

f z Y z z

z

Y N Y( | ) ( ) [ ( )]

exp

Φ Φ1

20

2

02 1

minus

times minus minus( ) ( )minus

minusμ σ⎡⎡

⎣⎢⎤⎦⎥

z ~ Normal 0μ σ02( )

c p= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pΦ Φ1 1( ) ( )h f

Figure 12 Probit transform of probability ( p) into z scores (z)The transform is the inverse cumulative distribution function ofthe normal distribution Lines show the transform of p 69 intoz 05

588 ROUDER AND LU

1000 samples with 100 samples of burn-in are morethan sufficient for this application

ESTIMATING PROBABILITY FORSEVERAL PARTICIPANTS

In this section we propose a model for estimating sev-eral participantsrsquo probability of success The main goalis to introduce hierarchical models within the context ofa relatively simple example The methods and insightsdiscussed here are directly applicable to the hierarchicalmodel of signal detection Consider data from a numberof participants each with his or her own unique proba-bility of success pi In conventional estimation each ofthese probabilities is estimated separately usually by theestimator piyi Ni where yi and Ni are the number ofsuccesses and trials respectively for the ith participantWe show here that hierarchical modeling leads to moreaccurate estimation

In a hierarchical model it is assumed that although in-dividuals vary in ability the distribution governing indi-vidualsrsquo true abilities is orderly We term this distributionthe parent distribution If we knew the parent distributionthis prior information would be useful in estimationConsider the example in Figure 15 Suppose 10 individ-uals each performed 10 trials Hypothetical performanceis indicated with Xs Letrsquos suppose the task is relativelyeasy so that individualsrsquo true probabilities of successrange from 7 to 10 as indicated by the curve in the fig-

ure One participant does poorly however succeeding ononly 3 trials The conventional estimate for this poor-performing participant is 3 which is a poor estimate ofthe true probability (between 7 and 10) If the parentdistribution is known it would be evident that this poorscore is influenced by a large degree of sample noise andshould be corrected upward as indicated by the arrowThe new estimate which is more accurate is denoted byan H (for hierarchical)

In the example above we assumed a parent distribu-tion In actuality such assumptions are unwarranted Inhierarchical models parent distributions are estimatedfrom the data these parents in turn affect the estimatesfrom outlying data The estimated parent for the data inFigure 15 would have much of its mass in the upperranges When serving as a prior the estimated parentwould exert an upward tendency in the estimation of theoutlying participantrsquos probability leading to better esti-mation In Bayesian analysis the method of implement-ing a hierarchical model is to use a hierarchical priorThe parent distribution serves as a prior for each indi-vidual and this parent is termed the first stage of theprior The parent distribution is not fully specified butinstead free parameters of the parent are estimated fromthe data These parameters also need a prior on themwhich is termed the second stage of the prior

The probit transform model is ideally suited for a hi-erarchical prior A normally distributed parent may beplaced on transformed probabilities of success The ith

Prior 1 Prior 2 Prior 3

Priorson z

Corresponding Priors on p

N(01) N(11) N(0 2)4

3

2

1

0

4

3

2

1

0

4

3

2

1

0

ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6

5

4

3

2

1

0

5

4

3

2

1

0

5

4

3

2

1

0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Figure 13 Three normally distributed priors on z and the corresponding priors on p Priors on z were obtainedby drawing 100000 samples from the normal Priors on p were obtained by transforming each sample of z

HIERARCHICAL BAYESIAN MODELS 589

participantrsquos true ability zi is simply a sample from thisnormal The number of successes is modeled as

(32)

with a parent distribution (first-stage prior) given by

(33)

In this case parameters μ and σ 2 describe the locationand variance of the parent distribution and reflect group-level properties Priors are needed on μ and σ 2 (second-stage priors) Suitable choices are

(34)

andσ 2 ~ Inverse Gamma (a b) (35)

Values of μ0 σ 02 a and b must be specif ied before

analysisWe performed an analysis on hypothetical data in order

to demonstrate the hierarchical model Table 2 shows hy-pothetical data for a set of 20 participants performing 50trials These success frequencies were generated from abinomial but each individual had a unique true proba-bility of success These true probabilities2 which rangedbetween 59 and 80 are shown in the second column ofTable 2 Parameter estimators p0 (Equation 12) and p2(Equation 14) are shown

All of the theoretical results needed to derive the con-ditional posterior distributions have already been pre-

sented The Albert and Chib (1995) analysis for this ap-plication is provided in Appendix B A coded version ofMCMC for this model is presented in Appendix C

The MCMC chain was run for 10000 iterations Pri-ors were set to be fairly diffuse (μ0 0 σ 0

2 1000 a b 1) All parameters converged to steady-state behav-ior quickly (burn-in was conservative at 100 iterations)Autocorrelation of samples of pi for 2 select participantsis shown in the bottom panels of Figure 16 The auto-correlation in these plots was typical of other param-eters Autocorrelation for this model was no worse thanthat for the nonhierarchical Bayesian model of the pre-ceding section The top left panel of Figure 16 showsposterior means of the probability of success as a func-tion of the true probability of success for the hierarchi-cal model (filled circles) Also included are classical estimates from p0 (filled circles) Points from the hier-archical model are on average closer to the diagonalAs is indicated in Table 2 the hierarchical model haslower root mean squared error than does its frequentistcounterpart p0 The top right panel shows the hierarchi-cal estimates as a function of p0 The effect of the hier-archical structure is clear The extreme estimates in thefrequentist approach are moderated in the Bayesiananalysis

Estimator p2 also moderates extreme values The ten-dency is to pull them toward the middle specifically es-timates are closer to 5 than they are with p0 For the dataof Table 2 the difference in efficiency between p2 andp0 is exceedingly small (about 0001) The reason thatthere was little gain for p2 is that it pulled estimates to-ward a value of 5 which is not the mean of the trueprobabilities The mean was p 69 and for this valueand these sample sizes estimators p2 and p0 have aboutthe same efficiency The hierarchical estimator is supe-

μ μ σ~ Normal 02( )

zi | μ σ μ σ2 ind~ Normal 2( )

y z N zi i i i| ind~ Binomial Φ( )⎡⎣ ⎤⎦

Figure 14 (Top) Histograms of the post-burn-in samples of zand p respectively for 7 successes out of 10 trials The line fit forp is the true posterior a beta distribution with a primeprime 8 and bprimeprime 4(Bottom) Autocorrelation function for z There is a small degreeof autocorrelation

Figure 15 Example of how a parent distribution affects esti-mation The outlying observation is incompatible with the parentdistribution Bayesian estimation allows for the influence of theparent as a prior The resulting estimate is shifted upward re-ducing error

590 ROUDER AND LU

rior to these nonhierarchical estimators because it pullsoutlying estimates toward the mean value p Of coursethe advantage of the hierarchical model is a function ofsample size With increasing sample sizes the amountof pulling decreases In the limit all three estimatorsconverge to the true individualrsquos probability

A HIERARCHICAL SIGNAL DETECTION MODEL

Our ultimate goal is a signal detection model that ac-counts for both participant and item variability As a pre-cursor we present a hierarchical signal detection modelwithout item variability the model will be expanded inthe next section to include item variability The modelhere is applicable to psychophysical experiments in whichsignal and noise trials are identical replicates respectively

Signal detection parameters are simple subtractions ofprobit-transformed probabilities Let d primei and ci denotesignal detection parameters for the ith participant

and

where pi(h) and p

i(f ) are individualsrsquo true hit and false

alarm probabilities respectively Let hi and fi be the pro-bit transforms hi1( p i

(h)) and fi1( pi(f )) Then

sensitivity and criteria are given by

d primei hi fiand

ci fi

The tactic we follow is to place hierarchical models on hand f rather than on signal detection parameters directlyBecause the relationship between signal detection pa-rameters and probit-transformed probabilities is linearaccurate estimation of hi and fi results in accurate esti-mation of d primei and ci

The model is developed as follows Let Ni and Si de-note the number of noise and signal trials given to the ithparticipant Let y

i(h) and y

i(f ) denote the the number of

hits and false alarms respectively The model for yi(h) and

yi(f ) is given by

and

It is assumed that each participantrsquos parameters aredrawn from parent distributions These parent distribu-tions serve as the first stage of the hierarchical prior andare given by

(36)

and

(37)

Second-stage priors are placed on parent distribution pa-rameters (μk σ k

2) as

(38)

and

(39)σ k k ka b k2 ~ Inverse Gamma h f( ) =

μk k ku v k~ Normal h f( ) =

f iind~ Normal f fμ σ 2( )

hiind~ Normal h hμ σ 2( )

y f N fi i i i( ) | f ind

~ Binomial Φ( )⎡⎣ ⎤⎦

y h S hi i i i( ) | h ind

~ Binomial Φ( )⎡⎣ ⎤⎦

c pi i= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pi i iΦ Φ1 1( ) ( )h f

Table 2Data and Estimates of a Probability of Success

Participant True Prob Data p0 p2 Hierarchical

1 587 33 66 654 6722 621 28 56 558 6153 633 34 68 673 6834 650 29 58 577 6275 659 32 64 635 6606 665 37 74 731 7167 667 30 60 596 6388 667 29 58 577 6279 687 34 68 673 684

10 693 34 68 673 68211 696 34 68 673 68312 701 37 74 731 71513 713 38 76 750 72814 734 38 76 750 72715 736 37 74 731 71716 738 33 66 654 67217 743 37 74 731 71618 759 33 66 654 67219 786 38 76 750 72820 803 42 84 827 771

RMSE 053 053 041Data are numbers of successes in 50 trials

HIERARCHICAL BAYESIAN MODELS 591

Analysis takes place in two steps The first is to obtainposterior samples from hi and fi for each participantThese are probit-transformed parameters that may be es-timated following the exact procedure in the previoussection Separate runs are made for signal trials (produc-ing estimates of hi) and for noise trials (producing esti-mates of fi) These posterior samples are denoted [hi] and[ fi] respectively The second step is to use these posteriorsamples to estimate posterior distributions of individuald primei and ci This is easily accomplished For all samples

(40)

and(41)

Hypothetical data and parameter estimates are shownin Table 3 Individual variability in the data was imple-mented by sampling each participantrsquos d prime and c from alog-normal and from a normal distribution respectivelyTrue d prime values are shown and they range from 123 to234 From these true values of d prime and c true individualhit and false alarm probabilities were computed and in-dividual hit and false alarm frequencies were sampledfrom a binomial Hit and false alarm frequencies wereproduced in this manner for 20 hypothetical individualseach observing 50 signal and 50 noise trials

A conventional sensitivity estimate for each individ-ual was obtained using the formula

Φ1(Hit rate) Φ1(False alarm rate)

The resulting estimates are shown in the last column ofTable 3 labeled Conventional The hierarchical signaldetection model was analyzed with diffuse prior param-eters ah af bh bf 01 uh uf 0 and vh vf 1000 The chain was run for 10000 iterations with thefirst 500 serving as burn-in This choice of burn-in pe-riod is very conservative Autocorrelation for sensitivityfor 2 select participants is displayed in the bottom pan-els of Figure 17 this degree of autocorrelation was typ-ical for all parameters Autocorrelation was about thesame as in the previous application and the resulting es-timates are shown in the next-to-last column of Table 3

The top left panel of Figure 17 shows a plot of esti-mated dprime as a function of true d prime Estimates from the hi-erarchical model are closer to the diagonal than are theconventional estimates indicating more accurate esti-mation Table 3 also provides efficiency information thehierarchical model is 33 more efficient than the con-ventional estimates (this amount is typical of repeatedruns of the same simulation) The reason for this gain isevident in the top right panel of Figure 17 The extremeestimates in the conventional method which most likely

c fi m i m⎡⎣ ⎤⎦ = minus ⎡⎣ ⎤⎦

prime⎡⎣ ⎤⎦ = ⎡⎣ ⎤⎦ minus ⎡⎣ ⎤⎦d h fi m i m m

ConventionalHierarchical

Participant 5 Participant 15

True Probability Conventional Estimates

Lag Lag

5 6 7 8 9 5 6 7 8 9

0 10 20 30 40 0 10 20 30 40

Aut

ocor

rela

tion

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

9

8

7

6

5

8

4

0

9

8

7

6

5

8

4

0

Figure 16 Modeling the probability of success across several partici-pants (Top left) Probability estimates as functions of true probabilitythe triangles represent estimator p0 and the circles represent posteriormeans of pi from the hierarchical model (Top right) Hierarchical esti-mates as a function of p0 (Bottom) Autocorrelation functions of the pos-terior probability estimates for 2 participants from the hierarchicalmodel The degree of autocorrelation is typical of the other participants

592 ROUDER AND LU

reflect sample noise are shrunk to more moderate esti-mates in the hierarchical model

SIGNAL DETECTION WITH ITEMVARIABILITY

In the introduction we stressed that Bayesian hierar-chical modeling may provide a solution for the statisti-cal problems associated with unmodeled variability innonlinear contexts In all of the previous examples thegains from hierarchical modeling were in better effi-ciency Conventional analysis although not as accurateas Bayesian analysis certainly yielded consistent if notunbiased estimates In this section item variability isadded to signal detection Conventional analyses inwhich scores are aggregated over items result in incon-sistent estimates characterized by asymptotic bias Inthis section we show that in contrast to conventionalanalysis Bayesian hierarchical models provide consis-tent and accurate estimation

Our route to the model is a bit circuitous Our first at-tempt based on a straightforward extension of the pre-vious techniques fails because of an excessive amountof autocorrelation The solution to this failure is to use apowerful and recently developed sampling schemeblocked sampling (Roberts amp Sahu 1997) First we willproceed with the straightforward but problematic exten-sion Then we will introduce blocked sampling and showhow it leads to tractable analysis

Straightforward ExtensionIt is straightforward to specify models that include

item effects The previous notation is here expanded byindexing hits and false alarms by both participants and

items Let yij(h) and yij

(f ) denote the response of the ith per-son to the jth item Values of yij

(h) and yij(f ) are dichoto-

mous indicating whether the response was old or newLet pij

(h) and pij(f ) be true probabilities that yij

(h) 1 andyij

(f ) 1 respectively that is these probabilities are truehit and false alarm rates for a given participant-by-itemcombination respectively Let hij and fij be the probittransforms [hij Φ1( pij

(h)) fij Φ1( pij(f ))] The model

is given by

(42)

and

(43)

Linear models are placed on hij and fij

hij μh αi γj (44)

and

fij μf βi δj (45)

Parameters μh and μf correspond to the grand means ofthe hit and false alarm rates respectively Parameters ααand ββ denote participant effects on hits and false alarms respectively parameters γγ and δδ denote item effects onhits and false alarms respectively These participant anditem effects are assumed to be zero-centered random ef-fects Grand mean signal detection parameters denotedd prime and c are given by

d prime μh μf (46)

and

c μf (47)

y fij ij( ) f ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

y hij ij( )h ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

Table 3Data and Estimates of Sensitivity

FalseParticipant True d prime Hits Alarms Hierarchical Conventional

1 1238 35 7 1608 16052 1239 38 17 1307 11193 1325 38 11 1554 14784 1330 33 16 1146 08805 1342 40 19 1324 11476 1405 39 8 1730 17677 1424 37 12 1466 13508 1460 27 5 1417 13829 1559 35 9 1507 1440

10 1584 37 9 1599 155911 1591 33 8 1486 140712 1624 40 7 1817 192213 1634 35 11 1427 129714 1782 43 13 1687 172415 1894 33 3 1749 196716 1962 45 12 1839 198817 1967 45 4 2235 268718 2124 39 6 1826 194719 2270 45 5 2172 256320 2337 46 6 2186 2580

RMSE 183 275Data are numbers of signal responses in 50 trials

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 3: Rouder & Lu (2005)

HIERARCHICAL BAYESIAN MODELS 575

et al (2002) proposed a mixed linear model that ac-counts for both participant and item variation Experi-mentalists however have a more intuitive approachreplication The more a finding is replicated the lowerthe chance that it is due to a Type I error For exampleconsider two independent researchers who replicate eachotherrsquos experiment at a nominal Type I error rate of 05Assume that due to unaccounted item variability the ac-tual Type I error rate for each replication is 2 If bothreplications are significant the combined Type I errorrate is 04 which is below the nominal criterion Psy-chologists do not often use strict replication but insteaduse near replications in which there are only minor pro-cedural differences across replicates Consider thebizarre memory example above Although Riefer andRouder (1992) used aggregation to conclude that bizarresentences are better recalled than common ones thebasic finding has been obtained repeatedly (see egEinstein McDaniel amp Lackey 1989 Hirshman Whel-ley amp Palij 1989 Pra Baldi de Beni Cornoldi amp Cave-don 1985 Wollen amp Cox 1981) so it is surely not theresult of a Type I error Oft-replicated phenomena suchas the Stroop effect and semantic priming effects arecertainly not spurious

The reason that replication is feasible in linear con-texts (such as those underlying both t tests and ANOVA)is that population means can be estimated without biaseven when there is unmodeled variability For examplein our simulation of Type I error rates estimates of truecondition means were quite accurate and showed no ap-parent bias Consequently the true difference betweentwo groups can be estimated without bias and may be ob-tained with greater accuracy by increasing sample sizeIn the case in which there is no true difference increas-ing the sample size yields increasingly better estimatesof the null group difference

The situation is not nearly so sanguine for nonlinearmodels Examples of nonlinear models include signal

detection (Green amp Swets 1966) process dissociation(Egan 1975 Jacoby 1991) the diffusion model (Rat-cliff 1978 Ratcliff amp Rouder 1998) the fuzzy logicalmodel of perception (Massaro amp Oden 1979) the hy-brid deadline model (Rouder 2000) the similaritychoice model (Luce 1963) the generalized contextmodel (Medin amp Schaffer 1978 Nosofsky 1986) andthe interactive activation model (McClelland amp Rumel-hart 1981) In fact almost all models used for cognitionand perception outside the ANOVAregression frame-work are nonlinear Nonlinear models are postulated be-cause they are more realistic than linear models Eventhough psychologists have readily adopted nonlinearmodels they have been slow to acknowledge the effectsof unmodeled variability in these contexts These effectsare not good Unmodeled variability often yields dis-torted parameter estimates in many nonlinear modelsFor this reason the assessment of true differences inthese models is difficult

EFFECTS OF VARIABILITY IN SIGNAL DETECTION

To demonstrate the effects of unmodeled variabilityon a nonlinear model we performed a set of simulationsin which the theory of signal detection (Green amp Swets1966) was applied to a recognition memory paradigmConsider the analysis of a recognition memory experi-ment that entails both randomly selected participants anditems In the signal detection model participants moni-tor the familiarity of a target If familiarity is above cri-terion participants report the target as an old item oth-erwise they report it as a new item The distribution offamiliarity is assumed to be greater for old than for newitems and the degree of this difference is the sensitivityHits occur when old items are judged as old false alarmsoccur when new items are judged as old The model isshown in Figure 2 It is reasonable to expect that there is

Figure 1 The effect of unmodeled item variability (σσ2) on Type I error rate whendata are aggregated across items All error rates were computed for a nominal Type Ierror rate of 05

576 ROUDER AND LU

participant-level variability in sensitivity some partici-pants have better mnemonic abilities than others It isreasonable also to expect item-level variability someitems are easier to remember than others

To explore the effects of variability we implementeda simulation similar to the previous one Let d primeij denotethe ith individualrsquos sensitivity to the jth item likewiselet cij denote the ith individualrsquos criterion when assessingthe familiarity of the jth item (cij is measured as the dis-tance from the mean of the new-item distribution seeFigure 2) Consider the following model

dijprime μdαiβj (8)

andcij μcαi 2βj 2 (9)

Parameters μd and μc are grand means parameter αi de-notes the effect of the ith participant and parameter βjdenotes the effect of the jth item The model on d prime is ad-ditive for items and participants in this sense it is anal-ogous to the previous model for reading times One odd-looking feature of this simulation model is thatparticipant and item effects are half as large on criteriaas they are on sensitivity This feature is motivated byGlanzer Adams Iverson and Kimrsquos (1993) mirror ef-fect The mirror effect refers to an often-observed patternin which manipulations that increase sensitivity do so byboth increasing hit rates and decreasing false alarmsConsider the case of unbiased responding shown in Fig-

ure 2 In the figure the criterion is half of the value of thesensitivity If a manipulation were to increase sensitivityby increasing the hit rate and decreasing the false alarmrate in equal increments then criterion must increasehalf as much as the sensitivity gain (like sensitivity cri-terion is measured from 0 the mean of the new-item dis-tribution) In Equations 8 and 9 the particular effect ofa participant or item is balanced in hit and false alarmrates If a particular participant or item is associated withhigh sensitivity the effect is equally strong in hit andfalse alarm rates

Because αi and βj denote the effects from randomlyselected participants and items it is reasonable to modelthem as random effects These are given by

(10)

and

(11)

We simulated data in a manner similar to the previoussimulation First item and participant effects were sam-pled according to Equations 10 and 11 Then these ran-dom effects were combined according to Equations 8and 9 in order to produce underlying sensitivities and cri-teria for each participant item combination On the basisof these true values hypothetical responses were pro-duced in accordance with the signal detection modelshown in Figure 2 These hypothetical responses are di-chotomous A specific participant judges a specific itemas either old or new These responses were aggregatedacross items to produce hit and false alarm rates for eachindividual From these aggregated rates d prime and c werecomputed for each participant Because individualizedestimates of d prime were computed variability across partic-ipants was not problematic (σ1 was set equal to 5) Theleft panel of Figure 3 shows the results of a simulationwith no item effects (σ2 0) As expected d prime estimatesappear unbiased The right panel shows a case with itemvariability (σ2 15) The estimates systematicallyunderestimate true values This underestimation is an as-ymptotic bias it does not diminish as the numbers ofparticipants and items increase We implemented vari-ability in the context of a mirror effect but the underes-timation is also obtained in other implementations Sim-ulations with item variability exclusively in hits result inthe same underestimation Likewise simulations withitem variability in c alone result in the same underesti-mation (see Wickelgren 1968 who makes a comparableargument)

The presence of asymptotic bias is disconcerting Un-like analysis with ANOVA replication does not guaran-tee correct inference In the case of signal detection onecannot tell whether a difference in overall estimated sen-sitivity between two conditions is due to a true sensitiv-ity difference or to an increase in unmodeled variabilityin one condition relative to the other One domain inwhich asymptotic underestimation is particularly perni-

β σjind~ Normal 0 2

2 ( )

α σiind~ Normal 0 1

2( )

Figure 2 The signal detection model Hit and false alarm prob-abilities are the areas of the old-item and new-item familiaritydistributions that are greater than the criterion respectively Val-ues of sensitivity (d primeprime) and criterion (c) are measured from 0 themean of the new-item familiarity distribution

HIERARCHICAL BAYESIAN MODELS 577

cious is signal detection analyses of subliminal semanticactivation Greenwald Draine and Abrams (1996) forexample used signal detection sensitivity aggregatedacross items to claim that a set of word primes was sub-liminal (d prime 0) These subliminal primes then affect re-sponse times to subsequent stimuli The asymptoticdownward bias from item variability however renderssuspect the claim that the primes were truly subliminal

The presence of asymptotic bias with unmodeled vari-ability is not unique to signal detection In general un-modeled variability leads to asymptotic bias in nonlinearmodels Psychologists have long known about problemsassociated with aggregation in specific contexts For ex-ample Estes (1956) warned about aggregating learningcurves across participants Aggregate learning curves maybe more graded than those of individuals leading re-searchers to misdiagnose underlying mechanisms(Haider amp Frensch 2002 Heathcote Brown amp Mewhort2000) Curran and Hintzman (1995) critiqued aggrega-tion in Jacobyrsquos (1991) process dissociation procedureThey showed that aggregating responses over partici-pants items or both possibly leads to asymptotic under-estimation of automaticity estimates Ashby Maddoxand Lee (1994) showed that aggregating data across par-ticipants possibly distorts estimates from similaritychoice-based scaling (such as those from Gilmore HershCaramazza amp Griff in 1979) Each of the examplesabove is based on the same general problem Unmodeledvariability distorts estimates in a nonlinear context In allof these cases the distortion does not diminish with in-creasing sample sizes Unfortunately the field has beenslow to grasp the general nature of this problem The cri-tiques above were made in isolation and appeared as spe-cific critiques of specific models Instead we see them

as different instances of the negative effects of aggregat-ing data in analyses with nonlinear models

The solution is to model both participant and itemvariability simultaneously We advocate Bayesian hier-archical models for this purpose (Rouder Lu Speck-man Sun amp Jiang 2005 Rouder Sun Speckman Luamp Zhou 2003) In a hierarchical model variability onseveral levels such as from the selection of items andparticipants as well as from measurement error is mod-eled simultaneously As was mentioned previously hier-archical models have been used in linear contexts to im-prove power and to better control Type I error rates (seeeg Baayen et al 2002 Kreft amp de Leeuw 1998) Al-though better statistical control is attractive to experi-mentalists it is not critical Instead it is often easier (andwiser) to replicate results than to learn and implementcomplex statistical methodologies For nonlinear mod-els however hierarchical models are critical becausebias is asymptotic

The main drawback to nonlinear hierarchical modelsis tractability They are difficult to implement Startingin the 1980s and 1990s however new computationaltechniques based on Bayesian analysis emerged in sta-tistics The utility of these new techniques is immensethey allow for the analysis of previously intractable mod-els including nonlinear hierarchical ones (Gelfand ampSmith 1990 Gelman Carlin Stern amp Rubin 2004Gill 2002 Tanner 1996) These new Bayesian-basedtechniques have had tremendous impact in many quanti-tatively oriented disciplines including statistics engi-neering bioinformatics and economics

The goal of this article is to provide an introduction toBayesian hierarchical modeling The focus is on a rela-tively new computational technique that has made

No Item Variability Item Variability

True Sensitivity True Sensitivity

Est

imat

ed S

ensi

tivity

30

25

20

15

10

05

0

30

25

20

15

10

05

0

0 05 10 15 20 25 30 0 05 10 15 20 25 30

Figure 3 The effect of unmodeled item variability (σσ2) on the estimation of sensitivity (d primeprime) True values of d primeprimeare the means of d primeprimeij across items The left and right panels respectively show the results without and with itemvariability (σσ 2 15) For both plots the value of σσ 1 was 5

578 ROUDER AND LU

Bayesian analysis more tractable Markov chain MonteCarlo sampling In the first section basic issues of esti-mation are discussed Then this discussion is expandedto include the basics of Bayesian estimation Followingthat Markov chain Monte Carlo integration is intro-duced Finally a series of hierarchical signal detectionmodels is presented These models provide unbiased andrelatively efficient sensitivity estimates for the standardequal-variance signal detection model even in the pres-ence of both participant and item variability

ESTIMATION

It is easier to discuss Bayesian estimation in the con-text of a simple example Consider the case in which asingle participant performs N trials The result of eachtrial is either a success or a failure We wish to estimatethe underlying true probability of a success denoted pfor this single participant Estimating a probability ofsuccess is essential to many applications in cognitivepsychology including signal detection analysis In signaldetection we typically estimate hit and false alarm ratesHits are successes on old-item trials and false alarms arefailures on new-item trials

A formula that provides an estimate is termed an esti-mator A reasonable estimator for this case is

(12)

where y is the number of successes out of N trials Esti-mator p0 is natural and straightforward

To evaluate the usefulness of estimators statisticiansusually discuss three basic properties bias efficiencyand consistency Bias and efficiency are illustrated inTable 1 The data are the results of weighing a hypothet-ical person of 170 lb on two hypothetical scales four sep-arate times Bias refers to the mean of repeated estimatesScale A is unbiased because the mean of the estimatesequals the true value of 170 lb Scale B is biased Themean is 172 lb which is 2 lb greater than the true valueof 170 lb Scale B however has a smaller degree of errorthan does Scale A so Scale B is termed more efficientthan Scale A Efficiency is the inverse of the expectederror of an observation and may be indexed by the reci-procal of root mean squared error Bias and efficiencyhave the same meaning for estimators as they do forscales Bias refers to the difference between the averagevalue of an estimator and a true value Efficiency refers

to the average magnitude of the difference between anestimator and a true value In many situations efficiencydetermines the quality of inference more than bias does

How biased and efficient is estimator p0 To providea context for evaluation consider the following two al-ternative estimators

(13)

and

(14)

These two alternatives may seem unprincipled but as isdiscussed in the next section they are justif ied in aBayesian framework

Figure 4 shows sampling distributions for the threeprobability estimators when the true probability is p 7and there are N 10 trials Estimator p0 is not biasedbut estimators p1 and p2 are Surprisingly estimator p2has the lowest average error that is it is the most effi-cient The reason is that the tails of the sampling distri-bution are closer to the true value of p 7 for p2 thanfor p1 or p0 Figure 4 shows the case for a single truevalue of p 7 Figure 5 shows bias and efficiency forall three estimators for the full range of p The conven-tional estimator p0 is unbiased for all true values of pbut the other two estimators are biased for extreme prob-abilities None of the estimators is always more efficientthan the others For intermediate probabilities estimatorp2 is most efficient for extreme probabilities estimatorp0 is most efficient Typically researchers have someidea of what type of probability of success to expect intheir experiments This knowledge can therefore be usedto help pick the best estimator for a particular situation

The other property of estimators is consistency Aconsistent estimator converges to its true value As thesample size is increased the estimator not only becomesunbiased the overall variance shrinks toward 0 Consis-tency should be viewed as a necessary property of a goodestimator Fortunately many estimators used in psychol-ogy are consistent including sample means sample vari-ances and sample correlation coefficients The three es-timators p0 p1 and p2 are consistent with sufficientdata they converge to the true value of p In fact all es-timators presented in this article are consistent

BAYESIAN ESTIMATION OF APROBABILITY

In this section we provide a Bayesian approach to es-timating the probability of success The datum in this ap-plication is the number of successes (y) and this quan-tity may be modeled as a binomial

y ~ Binomial(N p)

where N is known and p is the unknown parameter of interest

ˆ py

N212

= ++

ˆ p

yN1

51

= ++

ˆ pyN0 =

Table 1Weight of a 170-Pound Person Measured on

Two Hypothetical Scales

Scale A Scale B

Data (lb) 180 160 175 165 174 170 173 171Mean 170 172Bias 0 20RMSE 791 255

HIERARCHICAL BAYESIAN MODELS 579

Bayesrsquos (1763) insightful contribution was to use thelaw of conditional probability to estimate the parameterp He noted that

(15)

We describe each of these terms in orderThe main goal is to find f ( p | y) the quantity on the

left-hand side of Equation 15 The term f( p | y) is the dis-tribution of the parameter p given the data y This distri-bution is referred to as the posterior distribution It isviewed as a function of the parameter p The mean of theposterior distribution termed the posterior mean servesas a suitable point estimator for p

The term f(y | p) plays two roles in statistics depend-ing on context When it is viewed as a function of y forknown p it is known as the probability mass functionThe probability mass function describes the probabilityof any outcome y given a particular value of p For a bi-nomial random variable the probability mass function isgiven by

(16)

For example this function can be used to compute theprobability of observing a total of two successes on threetrials when p 5

38

The term f(y | p) can also be viewed as a function ofparameter p for fixed y In this case it is termed the like-lihood function and describes the likelihood of a partic-ular parameter value p given a fixed value of y For thebinomial the likelihood function is given by

(17)

In Bayesrsquos theorem we are interested in the posterior asa function of the parameter p Hence the right-handterms of Equation 15 are viewed as a function of param-eter p Therefore f (y | p) is viewed as the likelihood of prather than the probability mass of y

The term f (p) is the prior distribution and reflects theexperimenterrsquos a priori beliefs about the true value of p

Finally the term f (y) is the distribution of the datagiven the model Although its interpretation is importantin some contexts it plays a minimal role in the develop-

f y pN

yp p py N y( | ) ( ) =

⎛⎝⎜

⎞⎠⎟

minus le leminus1 0 1

f y p( | ) ( ) ( )= = =⎛⎝⎜

⎞⎠⎟

minus2 53

25 1 52 1

f y p

N

yp p y Ny N y

( | )( )

=⎛⎝⎜

⎞⎠⎟

minus =minus1 0

0 otherwiise

⎧⎨⎪

⎩⎪

f p yf y p f p

f y( | )

( | ) ( )( )

=

Figure 4 Sampling distributions of p0 p1 and p2 for N 10 trialswith a p 7 true probability of success on any one trial Bias and rootmean squared error (RMSE) are included

580 ROUDER AND LU

ment presented here for the following reason The pos-terior is a function of the parameter of interest p butf (y) does not depend on p The term f (y) is a constantwith respect to p and serves as a normalizing constantthat ensures that the posterior density integrates to 1 Aswill be shown in the examples the value of this normal-izing constant usually becomes apparent in analysis

To perform Bayesian analysis the researcher mustchoose a prior distribution For this application of esti-mating a probability from binomially distributed datathe beta distribution serves as a suitable prior distribu-tion for p The beta is a flexible two-parameter distribu-tion see Figure 6 for examples In contrast to the normaldistribution which has nonzero density on all numbersthe beta has nonzero density between 0 and 1 The betadistribution is a function of two parameters a and b thatdetermine its shape The notation for specifying a betadistribution prior for p is

p ~ Beta(a b)

The density of a beta random variable is given by

(18)

The denominator Be(a b) is termed the beta function1

Like f (y) it is not a function of p and plays the role of anormalizing constant in analysis

In practice researchers must completely specify theprior distribution before analysis that is they must choosesuitable values for a and b This choice reflects re-searchersrsquo beliefs about the possible values of p beforedata are collected When a 1 and b 1 (see the mid-

dle panel of Figure 6) the beta distribution is flat that isthere is equal density for all values in the interval [0 1]By choosing a b 1 the researcher is committing toall values of p being equally likely before data are col-lected Researchers can choose values of a and b to bestmatch their beliefs about the expected data For exam-ple consider a psychophysical experiment in which thevalue of p will almost surely be greater than 5 For thisexperiment priors with a b are appropriate

The goal is to derive the posterior distribution usingBayesrsquos theorem (Equation 15) Substituting the likeli-hood function (Equation 17) and prior (Equation 18) intoBayesrsquos theorem yields

Collecting terms in p yields

f ( p | y) kp (ya1)(1p)(Nyb1) (19)

where

The term k is constant with respect to p and serves as anormalizing constant Substituting

aprime y a bprime N y b (20)

into Equation 19 yields

f ( p | y) kp aprime1(1p)bprime1

This equation is proportional to a beta density for pa-rameters aprime and bprime To make f ( p | y) a proper probabilitydensity k has a value such that f ( p | y) integrates to 1

k a b f yN

y= times

⎛⎝⎜

⎞⎠⎟

minus[ ( ) ( )] Be 1

f p yN

y

p p p pa b

y N y a b

( | )( ) ( )

( =

⎛⎝⎜

⎞⎠⎟

minus minusminus minus minus1 11 1

Be )) ( )

times f y

f pp p

a b

a b

( )( )

( )= minusminus minus1 11

Be

Figure 5 Bias and root mean squared error (RMSE) for thethree estimators as functions of true probability of success Thesolid dashed and dasheddotted lines denote the characteristicsof p0 p1 and p2 respectively

Figure 6 Probability density function of the beta distributionfor various values of parameters a and b

HIERARCHICAL BAYESIAN MODELS 581

This occurs only if the value of k is a beta functionmdashforexample k 1Be(a prime bprime) Consequently

(21)

The posterior like the prior is a beta but with param-eters aprime and bprime given by Equation 20

The derivation above was greatly facilitated by sepa-rating those terms that depend on the parameter of inter-est from those that do not The latter terms serve as anormalizing constant and can be computed by ensuringthat the posterior integrates to 10 In most derivationsthe value of the normalizing constant becomes apparentmuch as it did above Consequently it is convenient toconsider only those terms that depend on the parameterof interest and to lump the rest into a proportionality con-stant Hence a convenient form of Bayesrsquos theorem is

Posterior distribution Likelihood function Prior distribution

The symbol ldquordquo denotes proportionality and is read isproportional to This form of Bayesrsquos theorem is usedthroughout this article

Figure 7 shows both the prior and posterior of p forthe case of y 7 successes in N 10 trials The priorcorresponds to a beta with a b 1 and the observeddata are 7 successes in 10 trials The posterior in thiscase is a beta distribution with parameters aprime 8 andbprime 4 As can be seen the posterior is considerablymore narrow than the prior indicating that a large degreeof information has been gained from the data There aretwo valuable quantities derived from the posterior distri-bution the posterior mean and the 95 highest densityregion (also termed the 95 credible interval) The for-mer is the point estimate of p and the latter serves anal-ogously to a confidence interval These two quantitiesare also shown in Figure 7

For the case of a beta-distributed posterior the ex-pression for the posterior mean is

The posterior mean reduces to p1 if a b 5 it reducesto p2 if a b 1 Therefore estimators p1 and p2 aretheoretically justified from a Bayesian perspective Theestimator p0 may be obtained for values of a b 0 Forthese values however the integral of the prior distribu-tion is infinite If the integral of a distribution is infinitethe distribution is termed improper Conversely if the integral is finite the distribution is termed proper InBayesian analysis it is essential that posterior distribu-tions be proper However it is not necessary for the priorto be proper In some cases an improper prior will stillyield a proper posterior When this occurs the analysisis perfectly valid For estimation of p the posterior isproper even when a b 0

When the prior and posterior are from the same distri-bution the prior is termed conjugate The beta distribu-tion is the conjugate prior for binomially distributed databecause the resulting posterior is also a beta distributionConjugate priors are convenient they often facilitate thederivation of posterior distributions Conjugacy howeveris not at all necessary for Bayesian analysis A discussionabout conjugacy is provided in the General Discussion

BAYESIAN ESTIMATION WITH NORMALLY DISTRIBUTED DATA

In this section we present Bayesian analysis for nor-mally distributed data The normal plays a critical role inthe hierarchical signal detection model We will assumethat participant and item effects on sensitivity and biasare distributed as normals The results and techniquesdeveloped in this section are used directly in analyzingthe subsequent hierarchical signal detection models

Consider the case in which a sequence of observa-tions w (w1 wN) are independent and identicallydistributed as normals

The goal is to estimate parameters μ and σ 2 In this sec-tion we discuss a more limited goalmdashthe estimation ofone of the parameters assuming that the other is knownIn the next section we will introduce a form of Markovchain Monte Carlo sampling for estimation when bothparameters are unknown

Estimating μμ With Known σσ 2

Assume that σ 2 is known and that the estimation of μis of primary interest The goal then is to derive the pos-

wjind~ Normal μ σ 2( )

p aa b

y aN a b

= primeprime + prime

= ++ +

f p yp p

a b

a b

( | )( )( )

= minusprime prime

primeminus primeminus1 11Be

Figure 7 Prior and posterior distributions for 7 successes outof 10 trials Both are distributed as betas Parameters for theprior are a 1 and b 1 parameters for the posterior are aprimeprime 8 and b primeprime 4 The posterior mean and 95 highest density regionare also indicated

582 ROUDER AND LU

terior f (μ |σ 2 w) According to the proportional formof Bayesrsquos theorem

All terms are conditioned on σ 2 to reflect the fact that itis known

The first step is to choose a prior distribution for μThe normal is a suitable choice

Parameters μ0 and σ 02 must be specified before analysis

according to the researcherrsquos beliefs about μ By choos-ing σ 0

2 to be increasingly large the researcher can makethe prior arbitrarily variable In the limit the prior ap-proaches having equal density across all values Thisprior is considered noninformative for μ

The density of the prior is given by

Note that the prior on μ does not depend on σ 2 there-fore f (μ) f (μ |σ2)

The next step is multiplying the likelihood f (w | μσ 2) by the prior density f (μ |σ 2) For normally distrib-uted data the likelihood is given by

(22)

Multiplying the equation above by the prior yields

We concern ourselves with only those terms that involveμ the parameter of interest Other terms are used to nor-malize the posterior density and are absorbed into theconstant of proportionality

where

(23)

and

(24)

The posterior is proportional to the density of a normalwith mean and variance given by μprime and σ 2prime respectivelyBecause the conditional posterior distribution must inte-grate to 10 this distribution must be that of a normalwith mean μprime and variance σ 2prime

(25)Unfortunately the notation does not make it clear thatboth μprime and σ 2prime depend on the value of σ 2 When this de-pendence is critical it will be made explicit μprime(σ 2) andσ 2prime(σ 2)

Because the posterior and prior are from the samefamily the normal distribution is the conjugate prior forthe population mean with normally distributed data (withknown variance) In the limit that the prior is diffuse (ieas its variance is made increasingly large) the posteriorof μ is a normal with μprime w and σ 2prime σ 2N This dis-tribution is also the sampling distribution of the samplemean in classical statistics Hence for the diffuse priorthe Bayesian and conventional approaches yield the sameresults

Figure 8 shows the estimation for an example in whichthe data are w (112 106 104 111) and the variance isassumed known to be 16 The mean of these four obser-vations is 10825 For demonstration purposes the prior

μ σ μ σ| ~ 2 2w Normal prime( )prime

prime = +⎛⎝⎜

⎞⎠⎟

minus minus primesumμ σ σ μ σ20

20

2w jj

σ σ σprime minus minus minus= +( )2 2

02 1

N

f μ σμ μ

σ| exp 2

2

22w( ) minus minus prime( )⎛

⎝⎜⎜

⎠⎟⎟prime

f N( | ) expμ σσ σ

μ22

02

212

1w minus +⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

⎣⎢⎢

⎝⎜

minusminussum

+⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

+sum

+⎧⎨

22

0

02

2

202

02

w

w

jj

jj

σμσ

μ

σμσ

⎪⎪

⎩⎪

⎫⎬⎪

⎭⎪

⎦⎥⎥

⎠⎟

fw wj jjμ σ

μ μ

σ

μ

| exp22 2

212

2w( ) minus

minus +( )⎡

⎣⎢⎢

⎝⎜⎜

+

sum

220 0

2

02

2minus + ⎤

⎦⎥⎥

⎠⎟

μ μ μσ

fw j

j

μ σμ

σ

μ μ

| exp

exp

2

2

2

0

2w( ) minus minus( )⎛

⎜⎜

⎟⎟

timesminus minus

sum

(( )⎛

⎝⎜⎜

⎠⎟⎟

2

022σ

fw

N

j

j

μ σπσ

μ

σ| exp

2

2 2

2

21

2 2w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

timesminus minus( )⎛

⎝⎜⎜

⎠⎟⎟

1

2 20

0

2

02π σ

μ μ

σexp

fw

N

j

j

w | exp

μ σπσ

μ

σ2

2 2

2

21

2 2( ) =

( )minus minus( )⎛

⎜⎜

⎟sum ⎟⎟

f ( ) exp μπσ

μ μσ

=minus minus( )⎛

⎝⎜⎜

⎠⎟⎟

1

2 20

0

2

02

μ μ σ~ Normal 0 02( )

f f fμ σ μ σ μ σ| | | 2 2 2w w( ) ( ) ( )

HIERARCHICAL BAYESIAN MODELS 583

is constructed to be fairly informative with μ0 100and σ 0

2 100 The posterior is narrower than the priorreflecting the information gained from the data It is cen-tered at μprime 1079 with variance 385 Note that theposterior mean is not the sample mean but is modestlyinfluenced by the prior Whether the Bayesian or the con-ventional estimate is better depends on the accuracy ofthe prior For cases in which much is known about thedependent measure it may be advantageous to reflectthis information in the prior

Estimating σσ 2 With Known μμIn the last section the posterior for μ was derived for

known σ 2 In this section it is assumed that μ is knownand that the estimation of σ 2 is of primary interest Thegoal then is to estimate the posterior f (σ 2 | μ w) The inverse-gamma distribution is the conjugate prior for σ 2

with normally distributed data The inverse-gamma dis-tribution is not widely used in psychology As its nameimplies it is related to the gamma distribution If a ran-dom variable g is distributed as a gamma then 1g is dis-tributed as an inverse gamma An inverse-gamma prioron σ 2 has density

The parameters of the inverse gamma are a and b and inapplication these are chosen beforehand As a and b ap-proach 0 the inverse gamma approaches 1σ 2 which isconsidered the appropriate noninformative prior for σ 2

(Jeffreys 1961)

Multiplying the inverse-gamma prior by the likelihoodgiven in Equation 22 yields

Collecting only those terms dependent on σ 2 yields

where

aprime N2 a (26)and

(27)

The posterior is proportional to the density of an inversegamma with parameters aprime and bprime Because this posteriorintegrates to 10 it must also be an inverse gamma

σ 2 | μ w ~ Inverse Gamma (aprime bprime) (28)

Note that posterior parameter bprime depends explicitly onthe value of μ

MARKOV CHAIN MONTE CARLO SAMPLING

The preceding development highlights a problem Theuse of conjugate priors allowed for the derivation of pos-teriors only when some of the parameters were assumedto be known The posteriors in Equations 25 and 28 areknown as conditional posteriors because they depend onother parameters The quantities of primary interest arethe marginal posteriors μ | w and σ 2 | w The straightfor-

prime = minus( )⎡⎣⎢

⎤⎦⎥

+sumb w bj j μ2

2

f ba

σ μσ σ

2

2 1 21| exp w( )

( )minus prime⎛

⎝⎜⎞⎠⎟prime+

f

w b

N a

jj

σ μσ

μ

σ

2

2 2 1

2

2

1

5

|

exp

w( )

( )times minus

minus( ) +⎛

+ +

sum

⎝⎝

⎜⎜

⎟⎟

fw

N

j

j

σ μσ

μ

σ2

2 2

2

21

2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟⎟

times

sum

112 1 2

σ σ( )minus⎛

⎝⎜⎞⎠⎟+a

bexp

fw

N

j

j

σ μπσ

μ

σ2

2 2

2

21

2 2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

times( )

minus⎛⎝⎜

⎞⎠⎟+

b

a

ba

aΓ( )

exp σ σ2 1 2

f b

a

b a ba

σ σσ2

2 1 22 0( ) =

( )minus⎛

⎝⎜⎞⎠⎟

ge+

Γ( )exp

Figure 8 Prior and posterior distributions for estimating themean of normally distributed data with known variance Theprior and posterior are the dashed and solid distributions re-spectively The vertical line shows the sample mean The priorldquopullsrdquo the posterior slightly downward in this case

584 ROUDER AND LU

ward method of obtaining marginals is to integrate conditionals

In this case the integral is tractable but in many cases itis not When an integral is symbolically intractable how-ever it can often still be evaluated numerically One ofthe recent developments in numeric integration is Markovchain Monte Carlo (MCMC) sampling MCMC sam-pling has become quite popular in the last decade and isdescribed in detail in Gelfand and Smith (1990) We showhere how MCMC can be used to estimate the marginalposterior distributions of μ and σ 2 without assumingknown values We use a particular type of MCMC sam-pling known as Gibbs sampling (Geman amp Geman 1984)Gibbs sampling is a restricted form of MCMC the moregeneral form is known as MetropolisndashHastings MCMC

The goal is to find the marginal posterior distribu-tions MCMC provides a set of random samples from themarginal posterior distributions (rather than a closed-form derivation of the posterior density or cumulativedistribution function) Obtaining a set of random sam-ples from the marginal posteriors is sufficient for analy-sis With a sufficiently large sample from a random vari-able one can compute its density mean quantiles andany other statistic to arbitrary precision

To explain Gibbs sampling we will digress momen-tarily using a different example Consider the case ofgenerating samples from an ex-Gaussian random vari-able The ex-Gaussian is a well-known distribution incognitive psychology (Hohle 1965) It is the sum of nor-mal and exponential random variables The parametersare μ σ and τ the location and scale of the normal andthe scale of the exponential respectively An ex-Gaussianrandom variable x can be expressed in conditional form

andη ~ Exponential(τ )

We can take advantage of this conditional form to gen-erate ex-Gaussian samples First we generate a set ofsamples from η Samples will be denoted with squarebrackets [η]1 denotes the first sample [η]2 denotes thesecond and so on These samples can then be used in theconditional form to generate samples from the marginalrandom variable x For example the first two samples aregiven by

and

In this manner a whole set of samples from the mar-ginal x can be generated from the conditional randomvariable x |η Figure 9 shows that the method works Ahistogram of [x] generated in this manner converges tothe true ex-Gaussian density This is an example of MonteCarlo integration of a conditional density

We now return to the problem of deriving marginalposterior distributions of μ | w from the conditionalμ |σ 2 w If there was a set of samples from σ 2 | w MonteCarlo integration could be used to sample μ | w Like-wise if there was a set of samples of μ | w Monte Carlointegration could be used to sample σ 2 | w In Gibbs sam-pling these relations are used iteratively as followsFirst a value of [σ 2]1 is picked arbitrarily This value isthen used to sample the posterior of μ from the condi-tional in Equation 25

where μprime and σ 2 prime are the explicit functions of σ 2 givenin Equations 23 and 24 Next the value of [μ]1 is used tosample σ 2 from the conditional posterior in Equation 28In general for iteration m

(29)

and

(30)

In this manner sequences of random samples [μμ] and[σσ 2] are obtained

The initial samples of [μμ] and [σσ 2] certainly reflect thechoice of starting values [σ 2]1 and are not samples fromthe desired marginal posteriors It can be shown how-ever that under mild technical conditions the later sam-ples of [μμ] and [σσ 2] are samples from the desired mar-ginal posteriors (Tierney 1994) The initial region in

σ μ21

⎡⎣ ⎤⎦ prime prime ⎡⎣ ⎤⎦( )( )minusm ma b~ Inverse Gamma

[ ] μ μ σ σ σm m m~ Normal prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime2 2 2

μ μ σ σ σ⎡⎣ ⎤⎦ prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime1

2 2~ Normal1

2

1

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )2 22~ Normal +μ η σ

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )1 12~ Normal +μ η σ

x | ~ η μ η σNormal +( )2

f f f d( | ) | | μ μ σ σ σw w w= ( ) ( )int 2 2 2

Figure 9 An example of Monte Carlo integration of a normalconditional density to yield ex-Gaussian distributed samples Thehistogram presents the samples from a normal whose mean pa-rameter is distributed as an exponential The line is the densityof the appropriate ex-Gaussian distribution

HIERARCHICAL BAYESIAN MODELS 585

which the samples reflect the starting value is termed theburn-in period The later samples are the steady-state pe-riod Formally speaking the samples form irreducibleMarkov chains with stationary distributions that equalthe marginal posterior distribution (see Gilks Richard-son amp Spiegelhalter 1996) On a more informal levelthe first set of samples have random and arbitrary valuesAt some point by chance the random sample happens tobe from a high-density region of the true marginal pos-terior From this point forward the samples are from thetrue posteriors and may be used for estimation

Let m0 denote the point at which the samples are ap-proximately steady state Samples after m0 can be used toprovide both posterior means and credible regions of theposterior (as well as other statistics such as posteriorvariance) Posterior means are estimated by the arith-metic means of the samples

and

where M is the number of iterations Credible regions areconstructed as the centermost interval containing 95of the samples from m0 to M

To use MCMC to estimate posteriors of μ and σ 2 it isnecessary to sample from a normal and an inverse-gamma

distribution Sampling from a normal is provided in manysoftware packages and routines for low-level languagessuch as C or Fortran are readily available Samples froman inverse-gamma distribution may be obtained by tak-ing the reciprocal of samples from a gamma distributionAhrens and Dieter (1974 1982) provide algorithms forsampling the gamma distribution

To illustrate Gibbs sampling consider the case of es-timating IQ The hypothetical data for this example arefour replicates from a single participant w (112 106104 111) The classical estimation of μ is the samplemean (10825) and confidence intervals are constructedby multiplying the standard error (sw193) by the ap-propriate critical t value with three degrees of freedomFrom this multiplication the 95 confidence intervalfor μ is [1022 1144] Bayesian estimation was donewith Gibbs sampling Diffuse priors were placed on μ(μ0 0 σ 0

2 106) and σ 2 (a 106 b 106) Twochains of [μμ] are shown in the left panel of Figure 10each corresponding to a different initial value for [σ 2]1As can be seen both chains converge to their steady staterather quickly For this application there is very littleburn-in The histogram of the samples of [μμ] after burn-in is also shown (right panel) The histogram conforms toa t distribution with three degrees of freedom The poste-rior mean is 10826 and the 95 credible interval is[1022 1144] For normally distributed data the diffusepriors used in Bayesian analysis yield results that are nu-merically equivalent to their frequentist counterparts

The diffuse priors used in the previous case representthe situation in which researchers have no previous knowl-

σσ

2

2

0

0=

⎡⎣ ⎤⎦

minusgtsum

mm m

M m

μμ

=minus

gtsum [ ]m

m m

M m0

0

15

10

05

0

120

115

110

105

100

0 10 20 30 40 50 95 100 105 110 115 120IQ PointsIteration

Den

sity

IQ points

micro

Burn-in

Figure 10 Samples of μμ from Markov chain Monte Carlo integration for normally dis-tributed data The left panel shows two chains each from a poor choice of the initial value ofσσ 2 (the solid and dotted lines are for [σσ 2]1 10000 and [σσ 2]1 0001 respectively) Con-vergence is rapid The right panel shows the histogram of samples of μμ after burn-in Thedensity is the appropriate frequentist t distribution with three degrees of freedom

586 ROUDER AND LU

edge In the IQ example however much is known aboutthe distribution of IQ scores Suppose our test was normedfor a mean of 100 and a standard deviation of 15 Thisknowledge could be profitably used by setting μ0 100and σ 0

2 152 225 We may not know the particulartestndashretest correlation of our scales but we can surmisethat the standard deviations of replicates should be lessthan the population standard deviation After experiment-ing with the values of a and b we chose a 3 and b 100 The prior on σ 2 corresponding to these values isshown in Figure 11 It has much of its density below 100reflecting our belief that the testndashretest variability issmaller than the variability in the population Figure 11also shows a histogram of the posterior of μ (the posteriorwas obtained with Gibbs sampling as outlined above[σ 2]1 1 burn-in of 100 iterations) The posterior meanis 10795 which is somewhat below the sample mean of10825 This difference is from the prior people on aver-age score lower than the obtained scores Although it is abit hard to see in the figure the posterior distribution ismore narrow in the extreme tails than is the correspond-ing t distribution The 95 credible interval is [10211136] which is modestly different from the correspond-ing frequentist confidence interval This analysis shows

that the use of reasonable prior information can lead to aresult different from that of the frequentist analysis TheBayesian estimate is not only different but often more ef-ficient than the frequentist counterpart

The theoretical underpinnings of MCMC guaranteethat infinitely long chains will converge to the true pos-terior Researchers must decide how long to burn in chainsand then how long to continue in order to approximatethe posterior well The most important consideration inthis decision is the degree of correlation from one sam-ple to another In MCMC sampling consecutive samplesare not necessarily independent of one another The valueof a sample on cycle m 1 is in general dependent onthat of m If this dependence is small then convergencehappens quickly and relatively short chains are neededIf this dependence is large then convergence is muchslower One informal graphical method of assessing thedegree of dependence is to plot the autocorrelation func-tions of the chains The bottom right panel of Figure 11shows that for the normal application there appears to beno autocorrelation that is there is no dependence fromone sample to the next This fact implies that conver-gence is relatively rapid Good approximations may beobtained with short burn-ins and relatively short chains

Figure 11 Analysis with informative priors The two top panels showthe prior distributions on μμ and σσ 2 The bottom left panel shows the his-togram of samples of μμ after burn-in It differs from a t distribution fromfrequentist analysis in that it is shifted toward 100 and has a smallerupper tail The bottom right panel shows the autocorrelation functionfor μμ There is no discernible autocorrelation

HIERARCHICAL BAYESIAN MODELS 587

There are a number of formal procedures for assessingconvergence in the statistical literature (see eg Gel-man amp Rubin 1992 Geweke 1992 Raftery amp Lewis1992 see also Gelman et al 2004)

A PROBIT MODEL FOR ESTIMATING A PROBABILITY OF SUCCESS

The next step toward a hierarchical signal detectionmodel is to revisit estimation of a probability from indi-vidual trials Once again y successes are observed in Ntrials and y ~ Binomial(N p) where p is the parameterof interest In the previous section a beta prior was placedon parameter p and the conjugacy of the beta with the bi-nomial directly led to the posterior Unfortunately a betadistribution prior is not convenient as a prior for the sig-nal detection application

The signal detection model relies on the probit trans-form Figure 12 shows the probit transform which mapsprobabilities into z values using the normal inverse cu-mulative distribution function Whereas p ranges from 0to 1 z ranges across the real number line Let Φ denotethe standard normal cumulative distribution function Withthis notation p Φ(z) and conversely z Φ1( p)

Signal detection parameters are defined in terms ofprobit transforms

and

where p(h) and p(f ) are hit and false alarm probabilities re-spectively As a precursor to models of signal detectionwe develop estimation of the probit transform of a proba-bility of success z Φ1( p) The methods for doing sowill then be used in the signal detection application

When using a probit transform we model the y suc-cesses in N trials as

y ~ Binomial[N Φ(z)]

where z is the parameter of interestWhen using a probit it is convenient to place a normal

prior on z

(31)

Figure 13 shows various normal priors on z and by trans-form the corresponding prior on p Consider the specialcase when a standard normal is placed on z As is shownin the figure the corresponding prior on p is flat A flatprior also corresponds to a beta with a b 1 (see Fig-ure 6) From the previous development if a beta prior isplaced on p the posterior is also a beta with parametersaprime a y and bprime b (N y) Noting for a flat priorthat a b 1 it is expected that the posterior of p is dis-tributed as a beta with parameters aprime 1 y and bprime 1 (N y)

The next step is to derive the posterior f (z |Y ) for anormal prior on z The straightforward approach is tonote that the posterior is proportional to the likelihoodmultiplied by the prior

This equation is intractable Albert and Chib (1995) pro-vide an alternative for analysis and the details of this al-ternative are provided in Appendix A The basic idea isthat a new set of latent variables is defined upon which zis conditioned Conditional posteriors for both z andthese new latent variables are easily derived and sam-pled These samples can then be used in Gibbs samplingto find a marginal posterior for z

In this application it may be desirable to estimate theposterior of p as well as that of z This estimation is easyin MCMC Because p is the inverse probit transform ofz samples can also be inversely transformed Let [z] bethe chain of samples from the posterior of z A chain ofposterior samples of p is given by [ p]m Φ( [z]m)

The top right panel of Figure 14 shows the histogramof the posterior of p for y 7 successes on N 10 tri-als The prior parameters were μ0 0 and σ 0

2 1 so theprior on p was flat (see Figure 13) Because the prior isflat it corresponds to a beta with parameters (a 1 b 1) Consequently the expected posterior is a beta withdensity aprime y 1 8 and bprime (N y) 1 4 Thechains in the Gibbs sampler were started from a numberof values of [z]1 and convergence was always rapid Theresulting histogram for 10000 samples is plotted it ap-proximates the expected distribution The bottom panelshows a small degree of autocorrelation This autocorre-lation can be overcome by running longer chains andusing a more conservative burn-in period Chains of

f z Y z z

z

Y N Y( | ) ( ) [ ( )]

exp

Φ Φ1

20

2

02 1

minus

times minus minus( ) ( )minus

minusμ σ⎡⎡

⎣⎢⎤⎦⎥

z ~ Normal 0μ σ02( )

c p= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pΦ Φ1 1( ) ( )h f

Figure 12 Probit transform of probability ( p) into z scores (z)The transform is the inverse cumulative distribution function ofthe normal distribution Lines show the transform of p 69 intoz 05

588 ROUDER AND LU

1000 samples with 100 samples of burn-in are morethan sufficient for this application

ESTIMATING PROBABILITY FORSEVERAL PARTICIPANTS

In this section we propose a model for estimating sev-eral participantsrsquo probability of success The main goalis to introduce hierarchical models within the context ofa relatively simple example The methods and insightsdiscussed here are directly applicable to the hierarchicalmodel of signal detection Consider data from a numberof participants each with his or her own unique proba-bility of success pi In conventional estimation each ofthese probabilities is estimated separately usually by theestimator piyi Ni where yi and Ni are the number ofsuccesses and trials respectively for the ith participantWe show here that hierarchical modeling leads to moreaccurate estimation

In a hierarchical model it is assumed that although in-dividuals vary in ability the distribution governing indi-vidualsrsquo true abilities is orderly We term this distributionthe parent distribution If we knew the parent distributionthis prior information would be useful in estimationConsider the example in Figure 15 Suppose 10 individ-uals each performed 10 trials Hypothetical performanceis indicated with Xs Letrsquos suppose the task is relativelyeasy so that individualsrsquo true probabilities of successrange from 7 to 10 as indicated by the curve in the fig-

ure One participant does poorly however succeeding ononly 3 trials The conventional estimate for this poor-performing participant is 3 which is a poor estimate ofthe true probability (between 7 and 10) If the parentdistribution is known it would be evident that this poorscore is influenced by a large degree of sample noise andshould be corrected upward as indicated by the arrowThe new estimate which is more accurate is denoted byan H (for hierarchical)

In the example above we assumed a parent distribu-tion In actuality such assumptions are unwarranted Inhierarchical models parent distributions are estimatedfrom the data these parents in turn affect the estimatesfrom outlying data The estimated parent for the data inFigure 15 would have much of its mass in the upperranges When serving as a prior the estimated parentwould exert an upward tendency in the estimation of theoutlying participantrsquos probability leading to better esti-mation In Bayesian analysis the method of implement-ing a hierarchical model is to use a hierarchical priorThe parent distribution serves as a prior for each indi-vidual and this parent is termed the first stage of theprior The parent distribution is not fully specified butinstead free parameters of the parent are estimated fromthe data These parameters also need a prior on themwhich is termed the second stage of the prior

The probit transform model is ideally suited for a hi-erarchical prior A normally distributed parent may beplaced on transformed probabilities of success The ith

Prior 1 Prior 2 Prior 3

Priorson z

Corresponding Priors on p

N(01) N(11) N(0 2)4

3

2

1

0

4

3

2

1

0

4

3

2

1

0

ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6

5

4

3

2

1

0

5

4

3

2

1

0

5

4

3

2

1

0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Figure 13 Three normally distributed priors on z and the corresponding priors on p Priors on z were obtainedby drawing 100000 samples from the normal Priors on p were obtained by transforming each sample of z

HIERARCHICAL BAYESIAN MODELS 589

participantrsquos true ability zi is simply a sample from thisnormal The number of successes is modeled as

(32)

with a parent distribution (first-stage prior) given by

(33)

In this case parameters μ and σ 2 describe the locationand variance of the parent distribution and reflect group-level properties Priors are needed on μ and σ 2 (second-stage priors) Suitable choices are

(34)

andσ 2 ~ Inverse Gamma (a b) (35)

Values of μ0 σ 02 a and b must be specif ied before

analysisWe performed an analysis on hypothetical data in order

to demonstrate the hierarchical model Table 2 shows hy-pothetical data for a set of 20 participants performing 50trials These success frequencies were generated from abinomial but each individual had a unique true proba-bility of success These true probabilities2 which rangedbetween 59 and 80 are shown in the second column ofTable 2 Parameter estimators p0 (Equation 12) and p2(Equation 14) are shown

All of the theoretical results needed to derive the con-ditional posterior distributions have already been pre-

sented The Albert and Chib (1995) analysis for this ap-plication is provided in Appendix B A coded version ofMCMC for this model is presented in Appendix C

The MCMC chain was run for 10000 iterations Pri-ors were set to be fairly diffuse (μ0 0 σ 0

2 1000 a b 1) All parameters converged to steady-state behav-ior quickly (burn-in was conservative at 100 iterations)Autocorrelation of samples of pi for 2 select participantsis shown in the bottom panels of Figure 16 The auto-correlation in these plots was typical of other param-eters Autocorrelation for this model was no worse thanthat for the nonhierarchical Bayesian model of the pre-ceding section The top left panel of Figure 16 showsposterior means of the probability of success as a func-tion of the true probability of success for the hierarchi-cal model (filled circles) Also included are classical estimates from p0 (filled circles) Points from the hier-archical model are on average closer to the diagonalAs is indicated in Table 2 the hierarchical model haslower root mean squared error than does its frequentistcounterpart p0 The top right panel shows the hierarchi-cal estimates as a function of p0 The effect of the hier-archical structure is clear The extreme estimates in thefrequentist approach are moderated in the Bayesiananalysis

Estimator p2 also moderates extreme values The ten-dency is to pull them toward the middle specifically es-timates are closer to 5 than they are with p0 For the dataof Table 2 the difference in efficiency between p2 andp0 is exceedingly small (about 0001) The reason thatthere was little gain for p2 is that it pulled estimates to-ward a value of 5 which is not the mean of the trueprobabilities The mean was p 69 and for this valueand these sample sizes estimators p2 and p0 have aboutthe same efficiency The hierarchical estimator is supe-

μ μ σ~ Normal 02( )

zi | μ σ μ σ2 ind~ Normal 2( )

y z N zi i i i| ind~ Binomial Φ( )⎡⎣ ⎤⎦

Figure 14 (Top) Histograms of the post-burn-in samples of zand p respectively for 7 successes out of 10 trials The line fit forp is the true posterior a beta distribution with a primeprime 8 and bprimeprime 4(Bottom) Autocorrelation function for z There is a small degreeof autocorrelation

Figure 15 Example of how a parent distribution affects esti-mation The outlying observation is incompatible with the parentdistribution Bayesian estimation allows for the influence of theparent as a prior The resulting estimate is shifted upward re-ducing error

590 ROUDER AND LU

rior to these nonhierarchical estimators because it pullsoutlying estimates toward the mean value p Of coursethe advantage of the hierarchical model is a function ofsample size With increasing sample sizes the amountof pulling decreases In the limit all three estimatorsconverge to the true individualrsquos probability

A HIERARCHICAL SIGNAL DETECTION MODEL

Our ultimate goal is a signal detection model that ac-counts for both participant and item variability As a pre-cursor we present a hierarchical signal detection modelwithout item variability the model will be expanded inthe next section to include item variability The modelhere is applicable to psychophysical experiments in whichsignal and noise trials are identical replicates respectively

Signal detection parameters are simple subtractions ofprobit-transformed probabilities Let d primei and ci denotesignal detection parameters for the ith participant

and

where pi(h) and p

i(f ) are individualsrsquo true hit and false

alarm probabilities respectively Let hi and fi be the pro-bit transforms hi1( p i

(h)) and fi1( pi(f )) Then

sensitivity and criteria are given by

d primei hi fiand

ci fi

The tactic we follow is to place hierarchical models on hand f rather than on signal detection parameters directlyBecause the relationship between signal detection pa-rameters and probit-transformed probabilities is linearaccurate estimation of hi and fi results in accurate esti-mation of d primei and ci

The model is developed as follows Let Ni and Si de-note the number of noise and signal trials given to the ithparticipant Let y

i(h) and y

i(f ) denote the the number of

hits and false alarms respectively The model for yi(h) and

yi(f ) is given by

and

It is assumed that each participantrsquos parameters aredrawn from parent distributions These parent distribu-tions serve as the first stage of the hierarchical prior andare given by

(36)

and

(37)

Second-stage priors are placed on parent distribution pa-rameters (μk σ k

2) as

(38)

and

(39)σ k k ka b k2 ~ Inverse Gamma h f( ) =

μk k ku v k~ Normal h f( ) =

f iind~ Normal f fμ σ 2( )

hiind~ Normal h hμ σ 2( )

y f N fi i i i( ) | f ind

~ Binomial Φ( )⎡⎣ ⎤⎦

y h S hi i i i( ) | h ind

~ Binomial Φ( )⎡⎣ ⎤⎦

c pi i= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pi i iΦ Φ1 1( ) ( )h f

Table 2Data and Estimates of a Probability of Success

Participant True Prob Data p0 p2 Hierarchical

1 587 33 66 654 6722 621 28 56 558 6153 633 34 68 673 6834 650 29 58 577 6275 659 32 64 635 6606 665 37 74 731 7167 667 30 60 596 6388 667 29 58 577 6279 687 34 68 673 684

10 693 34 68 673 68211 696 34 68 673 68312 701 37 74 731 71513 713 38 76 750 72814 734 38 76 750 72715 736 37 74 731 71716 738 33 66 654 67217 743 37 74 731 71618 759 33 66 654 67219 786 38 76 750 72820 803 42 84 827 771

RMSE 053 053 041Data are numbers of successes in 50 trials

HIERARCHICAL BAYESIAN MODELS 591

Analysis takes place in two steps The first is to obtainposterior samples from hi and fi for each participantThese are probit-transformed parameters that may be es-timated following the exact procedure in the previoussection Separate runs are made for signal trials (produc-ing estimates of hi) and for noise trials (producing esti-mates of fi) These posterior samples are denoted [hi] and[ fi] respectively The second step is to use these posteriorsamples to estimate posterior distributions of individuald primei and ci This is easily accomplished For all samples

(40)

and(41)

Hypothetical data and parameter estimates are shownin Table 3 Individual variability in the data was imple-mented by sampling each participantrsquos d prime and c from alog-normal and from a normal distribution respectivelyTrue d prime values are shown and they range from 123 to234 From these true values of d prime and c true individualhit and false alarm probabilities were computed and in-dividual hit and false alarm frequencies were sampledfrom a binomial Hit and false alarm frequencies wereproduced in this manner for 20 hypothetical individualseach observing 50 signal and 50 noise trials

A conventional sensitivity estimate for each individ-ual was obtained using the formula

Φ1(Hit rate) Φ1(False alarm rate)

The resulting estimates are shown in the last column ofTable 3 labeled Conventional The hierarchical signaldetection model was analyzed with diffuse prior param-eters ah af bh bf 01 uh uf 0 and vh vf 1000 The chain was run for 10000 iterations with thefirst 500 serving as burn-in This choice of burn-in pe-riod is very conservative Autocorrelation for sensitivityfor 2 select participants is displayed in the bottom pan-els of Figure 17 this degree of autocorrelation was typ-ical for all parameters Autocorrelation was about thesame as in the previous application and the resulting es-timates are shown in the next-to-last column of Table 3

The top left panel of Figure 17 shows a plot of esti-mated dprime as a function of true d prime Estimates from the hi-erarchical model are closer to the diagonal than are theconventional estimates indicating more accurate esti-mation Table 3 also provides efficiency information thehierarchical model is 33 more efficient than the con-ventional estimates (this amount is typical of repeatedruns of the same simulation) The reason for this gain isevident in the top right panel of Figure 17 The extremeestimates in the conventional method which most likely

c fi m i m⎡⎣ ⎤⎦ = minus ⎡⎣ ⎤⎦

prime⎡⎣ ⎤⎦ = ⎡⎣ ⎤⎦ minus ⎡⎣ ⎤⎦d h fi m i m m

ConventionalHierarchical

Participant 5 Participant 15

True Probability Conventional Estimates

Lag Lag

5 6 7 8 9 5 6 7 8 9

0 10 20 30 40 0 10 20 30 40

Aut

ocor

rela

tion

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

9

8

7

6

5

8

4

0

9

8

7

6

5

8

4

0

Figure 16 Modeling the probability of success across several partici-pants (Top left) Probability estimates as functions of true probabilitythe triangles represent estimator p0 and the circles represent posteriormeans of pi from the hierarchical model (Top right) Hierarchical esti-mates as a function of p0 (Bottom) Autocorrelation functions of the pos-terior probability estimates for 2 participants from the hierarchicalmodel The degree of autocorrelation is typical of the other participants

592 ROUDER AND LU

reflect sample noise are shrunk to more moderate esti-mates in the hierarchical model

SIGNAL DETECTION WITH ITEMVARIABILITY

In the introduction we stressed that Bayesian hierar-chical modeling may provide a solution for the statisti-cal problems associated with unmodeled variability innonlinear contexts In all of the previous examples thegains from hierarchical modeling were in better effi-ciency Conventional analysis although not as accurateas Bayesian analysis certainly yielded consistent if notunbiased estimates In this section item variability isadded to signal detection Conventional analyses inwhich scores are aggregated over items result in incon-sistent estimates characterized by asymptotic bias Inthis section we show that in contrast to conventionalanalysis Bayesian hierarchical models provide consis-tent and accurate estimation

Our route to the model is a bit circuitous Our first at-tempt based on a straightforward extension of the pre-vious techniques fails because of an excessive amountof autocorrelation The solution to this failure is to use apowerful and recently developed sampling schemeblocked sampling (Roberts amp Sahu 1997) First we willproceed with the straightforward but problematic exten-sion Then we will introduce blocked sampling and showhow it leads to tractable analysis

Straightforward ExtensionIt is straightforward to specify models that include

item effects The previous notation is here expanded byindexing hits and false alarms by both participants and

items Let yij(h) and yij

(f ) denote the response of the ith per-son to the jth item Values of yij

(h) and yij(f ) are dichoto-

mous indicating whether the response was old or newLet pij

(h) and pij(f ) be true probabilities that yij

(h) 1 andyij

(f ) 1 respectively that is these probabilities are truehit and false alarm rates for a given participant-by-itemcombination respectively Let hij and fij be the probittransforms [hij Φ1( pij

(h)) fij Φ1( pij(f ))] The model

is given by

(42)

and

(43)

Linear models are placed on hij and fij

hij μh αi γj (44)

and

fij μf βi δj (45)

Parameters μh and μf correspond to the grand means ofthe hit and false alarm rates respectively Parameters ααand ββ denote participant effects on hits and false alarms respectively parameters γγ and δδ denote item effects onhits and false alarms respectively These participant anditem effects are assumed to be zero-centered random ef-fects Grand mean signal detection parameters denotedd prime and c are given by

d prime μh μf (46)

and

c μf (47)

y fij ij( ) f ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

y hij ij( )h ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

Table 3Data and Estimates of Sensitivity

FalseParticipant True d prime Hits Alarms Hierarchical Conventional

1 1238 35 7 1608 16052 1239 38 17 1307 11193 1325 38 11 1554 14784 1330 33 16 1146 08805 1342 40 19 1324 11476 1405 39 8 1730 17677 1424 37 12 1466 13508 1460 27 5 1417 13829 1559 35 9 1507 1440

10 1584 37 9 1599 155911 1591 33 8 1486 140712 1624 40 7 1817 192213 1634 35 11 1427 129714 1782 43 13 1687 172415 1894 33 3 1749 196716 1962 45 12 1839 198817 1967 45 4 2235 268718 2124 39 6 1826 194719 2270 45 5 2172 256320 2337 46 6 2186 2580

RMSE 183 275Data are numbers of signal responses in 50 trials

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 4: Rouder & Lu (2005)

576 ROUDER AND LU

participant-level variability in sensitivity some partici-pants have better mnemonic abilities than others It isreasonable also to expect item-level variability someitems are easier to remember than others

To explore the effects of variability we implementeda simulation similar to the previous one Let d primeij denotethe ith individualrsquos sensitivity to the jth item likewiselet cij denote the ith individualrsquos criterion when assessingthe familiarity of the jth item (cij is measured as the dis-tance from the mean of the new-item distribution seeFigure 2) Consider the following model

dijprime μdαiβj (8)

andcij μcαi 2βj 2 (9)

Parameters μd and μc are grand means parameter αi de-notes the effect of the ith participant and parameter βjdenotes the effect of the jth item The model on d prime is ad-ditive for items and participants in this sense it is anal-ogous to the previous model for reading times One odd-looking feature of this simulation model is thatparticipant and item effects are half as large on criteriaas they are on sensitivity This feature is motivated byGlanzer Adams Iverson and Kimrsquos (1993) mirror ef-fect The mirror effect refers to an often-observed patternin which manipulations that increase sensitivity do so byboth increasing hit rates and decreasing false alarmsConsider the case of unbiased responding shown in Fig-

ure 2 In the figure the criterion is half of the value of thesensitivity If a manipulation were to increase sensitivityby increasing the hit rate and decreasing the false alarmrate in equal increments then criterion must increasehalf as much as the sensitivity gain (like sensitivity cri-terion is measured from 0 the mean of the new-item dis-tribution) In Equations 8 and 9 the particular effect ofa participant or item is balanced in hit and false alarmrates If a particular participant or item is associated withhigh sensitivity the effect is equally strong in hit andfalse alarm rates

Because αi and βj denote the effects from randomlyselected participants and items it is reasonable to modelthem as random effects These are given by

(10)

and

(11)

We simulated data in a manner similar to the previoussimulation First item and participant effects were sam-pled according to Equations 10 and 11 Then these ran-dom effects were combined according to Equations 8and 9 in order to produce underlying sensitivities and cri-teria for each participant item combination On the basisof these true values hypothetical responses were pro-duced in accordance with the signal detection modelshown in Figure 2 These hypothetical responses are di-chotomous A specific participant judges a specific itemas either old or new These responses were aggregatedacross items to produce hit and false alarm rates for eachindividual From these aggregated rates d prime and c werecomputed for each participant Because individualizedestimates of d prime were computed variability across partic-ipants was not problematic (σ1 was set equal to 5) Theleft panel of Figure 3 shows the results of a simulationwith no item effects (σ2 0) As expected d prime estimatesappear unbiased The right panel shows a case with itemvariability (σ2 15) The estimates systematicallyunderestimate true values This underestimation is an as-ymptotic bias it does not diminish as the numbers ofparticipants and items increase We implemented vari-ability in the context of a mirror effect but the underes-timation is also obtained in other implementations Sim-ulations with item variability exclusively in hits result inthe same underestimation Likewise simulations withitem variability in c alone result in the same underesti-mation (see Wickelgren 1968 who makes a comparableargument)

The presence of asymptotic bias is disconcerting Un-like analysis with ANOVA replication does not guaran-tee correct inference In the case of signal detection onecannot tell whether a difference in overall estimated sen-sitivity between two conditions is due to a true sensitiv-ity difference or to an increase in unmodeled variabilityin one condition relative to the other One domain inwhich asymptotic underestimation is particularly perni-

β σjind~ Normal 0 2

2 ( )

α σiind~ Normal 0 1

2( )

Figure 2 The signal detection model Hit and false alarm prob-abilities are the areas of the old-item and new-item familiaritydistributions that are greater than the criterion respectively Val-ues of sensitivity (d primeprime) and criterion (c) are measured from 0 themean of the new-item familiarity distribution

HIERARCHICAL BAYESIAN MODELS 577

cious is signal detection analyses of subliminal semanticactivation Greenwald Draine and Abrams (1996) forexample used signal detection sensitivity aggregatedacross items to claim that a set of word primes was sub-liminal (d prime 0) These subliminal primes then affect re-sponse times to subsequent stimuli The asymptoticdownward bias from item variability however renderssuspect the claim that the primes were truly subliminal

The presence of asymptotic bias with unmodeled vari-ability is not unique to signal detection In general un-modeled variability leads to asymptotic bias in nonlinearmodels Psychologists have long known about problemsassociated with aggregation in specific contexts For ex-ample Estes (1956) warned about aggregating learningcurves across participants Aggregate learning curves maybe more graded than those of individuals leading re-searchers to misdiagnose underlying mechanisms(Haider amp Frensch 2002 Heathcote Brown amp Mewhort2000) Curran and Hintzman (1995) critiqued aggrega-tion in Jacobyrsquos (1991) process dissociation procedureThey showed that aggregating responses over partici-pants items or both possibly leads to asymptotic under-estimation of automaticity estimates Ashby Maddoxand Lee (1994) showed that aggregating data across par-ticipants possibly distorts estimates from similaritychoice-based scaling (such as those from Gilmore HershCaramazza amp Griff in 1979) Each of the examplesabove is based on the same general problem Unmodeledvariability distorts estimates in a nonlinear context In allof these cases the distortion does not diminish with in-creasing sample sizes Unfortunately the field has beenslow to grasp the general nature of this problem The cri-tiques above were made in isolation and appeared as spe-cific critiques of specific models Instead we see them

as different instances of the negative effects of aggregat-ing data in analyses with nonlinear models

The solution is to model both participant and itemvariability simultaneously We advocate Bayesian hier-archical models for this purpose (Rouder Lu Speck-man Sun amp Jiang 2005 Rouder Sun Speckman Luamp Zhou 2003) In a hierarchical model variability onseveral levels such as from the selection of items andparticipants as well as from measurement error is mod-eled simultaneously As was mentioned previously hier-archical models have been used in linear contexts to im-prove power and to better control Type I error rates (seeeg Baayen et al 2002 Kreft amp de Leeuw 1998) Al-though better statistical control is attractive to experi-mentalists it is not critical Instead it is often easier (andwiser) to replicate results than to learn and implementcomplex statistical methodologies For nonlinear mod-els however hierarchical models are critical becausebias is asymptotic

The main drawback to nonlinear hierarchical modelsis tractability They are difficult to implement Startingin the 1980s and 1990s however new computationaltechniques based on Bayesian analysis emerged in sta-tistics The utility of these new techniques is immensethey allow for the analysis of previously intractable mod-els including nonlinear hierarchical ones (Gelfand ampSmith 1990 Gelman Carlin Stern amp Rubin 2004Gill 2002 Tanner 1996) These new Bayesian-basedtechniques have had tremendous impact in many quanti-tatively oriented disciplines including statistics engi-neering bioinformatics and economics

The goal of this article is to provide an introduction toBayesian hierarchical modeling The focus is on a rela-tively new computational technique that has made

No Item Variability Item Variability

True Sensitivity True Sensitivity

Est

imat

ed S

ensi

tivity

30

25

20

15

10

05

0

30

25

20

15

10

05

0

0 05 10 15 20 25 30 0 05 10 15 20 25 30

Figure 3 The effect of unmodeled item variability (σσ2) on the estimation of sensitivity (d primeprime) True values of d primeprimeare the means of d primeprimeij across items The left and right panels respectively show the results without and with itemvariability (σσ 2 15) For both plots the value of σσ 1 was 5

578 ROUDER AND LU

Bayesian analysis more tractable Markov chain MonteCarlo sampling In the first section basic issues of esti-mation are discussed Then this discussion is expandedto include the basics of Bayesian estimation Followingthat Markov chain Monte Carlo integration is intro-duced Finally a series of hierarchical signal detectionmodels is presented These models provide unbiased andrelatively efficient sensitivity estimates for the standardequal-variance signal detection model even in the pres-ence of both participant and item variability

ESTIMATION

It is easier to discuss Bayesian estimation in the con-text of a simple example Consider the case in which asingle participant performs N trials The result of eachtrial is either a success or a failure We wish to estimatethe underlying true probability of a success denoted pfor this single participant Estimating a probability ofsuccess is essential to many applications in cognitivepsychology including signal detection analysis In signaldetection we typically estimate hit and false alarm ratesHits are successes on old-item trials and false alarms arefailures on new-item trials

A formula that provides an estimate is termed an esti-mator A reasonable estimator for this case is

(12)

where y is the number of successes out of N trials Esti-mator p0 is natural and straightforward

To evaluate the usefulness of estimators statisticiansusually discuss three basic properties bias efficiencyand consistency Bias and efficiency are illustrated inTable 1 The data are the results of weighing a hypothet-ical person of 170 lb on two hypothetical scales four sep-arate times Bias refers to the mean of repeated estimatesScale A is unbiased because the mean of the estimatesequals the true value of 170 lb Scale B is biased Themean is 172 lb which is 2 lb greater than the true valueof 170 lb Scale B however has a smaller degree of errorthan does Scale A so Scale B is termed more efficientthan Scale A Efficiency is the inverse of the expectederror of an observation and may be indexed by the reci-procal of root mean squared error Bias and efficiencyhave the same meaning for estimators as they do forscales Bias refers to the difference between the averagevalue of an estimator and a true value Efficiency refers

to the average magnitude of the difference between anestimator and a true value In many situations efficiencydetermines the quality of inference more than bias does

How biased and efficient is estimator p0 To providea context for evaluation consider the following two al-ternative estimators

(13)

and

(14)

These two alternatives may seem unprincipled but as isdiscussed in the next section they are justif ied in aBayesian framework

Figure 4 shows sampling distributions for the threeprobability estimators when the true probability is p 7and there are N 10 trials Estimator p0 is not biasedbut estimators p1 and p2 are Surprisingly estimator p2has the lowest average error that is it is the most effi-cient The reason is that the tails of the sampling distri-bution are closer to the true value of p 7 for p2 thanfor p1 or p0 Figure 4 shows the case for a single truevalue of p 7 Figure 5 shows bias and efficiency forall three estimators for the full range of p The conven-tional estimator p0 is unbiased for all true values of pbut the other two estimators are biased for extreme prob-abilities None of the estimators is always more efficientthan the others For intermediate probabilities estimatorp2 is most efficient for extreme probabilities estimatorp0 is most efficient Typically researchers have someidea of what type of probability of success to expect intheir experiments This knowledge can therefore be usedto help pick the best estimator for a particular situation

The other property of estimators is consistency Aconsistent estimator converges to its true value As thesample size is increased the estimator not only becomesunbiased the overall variance shrinks toward 0 Consis-tency should be viewed as a necessary property of a goodestimator Fortunately many estimators used in psychol-ogy are consistent including sample means sample vari-ances and sample correlation coefficients The three es-timators p0 p1 and p2 are consistent with sufficientdata they converge to the true value of p In fact all es-timators presented in this article are consistent

BAYESIAN ESTIMATION OF APROBABILITY

In this section we provide a Bayesian approach to es-timating the probability of success The datum in this ap-plication is the number of successes (y) and this quan-tity may be modeled as a binomial

y ~ Binomial(N p)

where N is known and p is the unknown parameter of interest

ˆ py

N212

= ++

ˆ p

yN1

51

= ++

ˆ pyN0 =

Table 1Weight of a 170-Pound Person Measured on

Two Hypothetical Scales

Scale A Scale B

Data (lb) 180 160 175 165 174 170 173 171Mean 170 172Bias 0 20RMSE 791 255

HIERARCHICAL BAYESIAN MODELS 579

Bayesrsquos (1763) insightful contribution was to use thelaw of conditional probability to estimate the parameterp He noted that

(15)

We describe each of these terms in orderThe main goal is to find f ( p | y) the quantity on the

left-hand side of Equation 15 The term f( p | y) is the dis-tribution of the parameter p given the data y This distri-bution is referred to as the posterior distribution It isviewed as a function of the parameter p The mean of theposterior distribution termed the posterior mean servesas a suitable point estimator for p

The term f(y | p) plays two roles in statistics depend-ing on context When it is viewed as a function of y forknown p it is known as the probability mass functionThe probability mass function describes the probabilityof any outcome y given a particular value of p For a bi-nomial random variable the probability mass function isgiven by

(16)

For example this function can be used to compute theprobability of observing a total of two successes on threetrials when p 5

38

The term f(y | p) can also be viewed as a function ofparameter p for fixed y In this case it is termed the like-lihood function and describes the likelihood of a partic-ular parameter value p given a fixed value of y For thebinomial the likelihood function is given by

(17)

In Bayesrsquos theorem we are interested in the posterior asa function of the parameter p Hence the right-handterms of Equation 15 are viewed as a function of param-eter p Therefore f (y | p) is viewed as the likelihood of prather than the probability mass of y

The term f (p) is the prior distribution and reflects theexperimenterrsquos a priori beliefs about the true value of p

Finally the term f (y) is the distribution of the datagiven the model Although its interpretation is importantin some contexts it plays a minimal role in the develop-

f y pN

yp p py N y( | ) ( ) =

⎛⎝⎜

⎞⎠⎟

minus le leminus1 0 1

f y p( | ) ( ) ( )= = =⎛⎝⎜

⎞⎠⎟

minus2 53

25 1 52 1

f y p

N

yp p y Ny N y

( | )( )

=⎛⎝⎜

⎞⎠⎟

minus =minus1 0

0 otherwiise

⎧⎨⎪

⎩⎪

f p yf y p f p

f y( | )

( | ) ( )( )

=

Figure 4 Sampling distributions of p0 p1 and p2 for N 10 trialswith a p 7 true probability of success on any one trial Bias and rootmean squared error (RMSE) are included

580 ROUDER AND LU

ment presented here for the following reason The pos-terior is a function of the parameter of interest p butf (y) does not depend on p The term f (y) is a constantwith respect to p and serves as a normalizing constantthat ensures that the posterior density integrates to 1 Aswill be shown in the examples the value of this normal-izing constant usually becomes apparent in analysis

To perform Bayesian analysis the researcher mustchoose a prior distribution For this application of esti-mating a probability from binomially distributed datathe beta distribution serves as a suitable prior distribu-tion for p The beta is a flexible two-parameter distribu-tion see Figure 6 for examples In contrast to the normaldistribution which has nonzero density on all numbersthe beta has nonzero density between 0 and 1 The betadistribution is a function of two parameters a and b thatdetermine its shape The notation for specifying a betadistribution prior for p is

p ~ Beta(a b)

The density of a beta random variable is given by

(18)

The denominator Be(a b) is termed the beta function1

Like f (y) it is not a function of p and plays the role of anormalizing constant in analysis

In practice researchers must completely specify theprior distribution before analysis that is they must choosesuitable values for a and b This choice reflects re-searchersrsquo beliefs about the possible values of p beforedata are collected When a 1 and b 1 (see the mid-

dle panel of Figure 6) the beta distribution is flat that isthere is equal density for all values in the interval [0 1]By choosing a b 1 the researcher is committing toall values of p being equally likely before data are col-lected Researchers can choose values of a and b to bestmatch their beliefs about the expected data For exam-ple consider a psychophysical experiment in which thevalue of p will almost surely be greater than 5 For thisexperiment priors with a b are appropriate

The goal is to derive the posterior distribution usingBayesrsquos theorem (Equation 15) Substituting the likeli-hood function (Equation 17) and prior (Equation 18) intoBayesrsquos theorem yields

Collecting terms in p yields

f ( p | y) kp (ya1)(1p)(Nyb1) (19)

where

The term k is constant with respect to p and serves as anormalizing constant Substituting

aprime y a bprime N y b (20)

into Equation 19 yields

f ( p | y) kp aprime1(1p)bprime1

This equation is proportional to a beta density for pa-rameters aprime and bprime To make f ( p | y) a proper probabilitydensity k has a value such that f ( p | y) integrates to 1

k a b f yN

y= times

⎛⎝⎜

⎞⎠⎟

minus[ ( ) ( )] Be 1

f p yN

y

p p p pa b

y N y a b

( | )( ) ( )

( =

⎛⎝⎜

⎞⎠⎟

minus minusminus minus minus1 11 1

Be )) ( )

times f y

f pp p

a b

a b

( )( )

( )= minusminus minus1 11

Be

Figure 5 Bias and root mean squared error (RMSE) for thethree estimators as functions of true probability of success Thesolid dashed and dasheddotted lines denote the characteristicsof p0 p1 and p2 respectively

Figure 6 Probability density function of the beta distributionfor various values of parameters a and b

HIERARCHICAL BAYESIAN MODELS 581

This occurs only if the value of k is a beta functionmdashforexample k 1Be(a prime bprime) Consequently

(21)

The posterior like the prior is a beta but with param-eters aprime and bprime given by Equation 20

The derivation above was greatly facilitated by sepa-rating those terms that depend on the parameter of inter-est from those that do not The latter terms serve as anormalizing constant and can be computed by ensuringthat the posterior integrates to 10 In most derivationsthe value of the normalizing constant becomes apparentmuch as it did above Consequently it is convenient toconsider only those terms that depend on the parameterof interest and to lump the rest into a proportionality con-stant Hence a convenient form of Bayesrsquos theorem is

Posterior distribution Likelihood function Prior distribution

The symbol ldquordquo denotes proportionality and is read isproportional to This form of Bayesrsquos theorem is usedthroughout this article

Figure 7 shows both the prior and posterior of p forthe case of y 7 successes in N 10 trials The priorcorresponds to a beta with a b 1 and the observeddata are 7 successes in 10 trials The posterior in thiscase is a beta distribution with parameters aprime 8 andbprime 4 As can be seen the posterior is considerablymore narrow than the prior indicating that a large degreeof information has been gained from the data There aretwo valuable quantities derived from the posterior distri-bution the posterior mean and the 95 highest densityregion (also termed the 95 credible interval) The for-mer is the point estimate of p and the latter serves anal-ogously to a confidence interval These two quantitiesare also shown in Figure 7

For the case of a beta-distributed posterior the ex-pression for the posterior mean is

The posterior mean reduces to p1 if a b 5 it reducesto p2 if a b 1 Therefore estimators p1 and p2 aretheoretically justified from a Bayesian perspective Theestimator p0 may be obtained for values of a b 0 Forthese values however the integral of the prior distribu-tion is infinite If the integral of a distribution is infinitethe distribution is termed improper Conversely if the integral is finite the distribution is termed proper InBayesian analysis it is essential that posterior distribu-tions be proper However it is not necessary for the priorto be proper In some cases an improper prior will stillyield a proper posterior When this occurs the analysisis perfectly valid For estimation of p the posterior isproper even when a b 0

When the prior and posterior are from the same distri-bution the prior is termed conjugate The beta distribu-tion is the conjugate prior for binomially distributed databecause the resulting posterior is also a beta distributionConjugate priors are convenient they often facilitate thederivation of posterior distributions Conjugacy howeveris not at all necessary for Bayesian analysis A discussionabout conjugacy is provided in the General Discussion

BAYESIAN ESTIMATION WITH NORMALLY DISTRIBUTED DATA

In this section we present Bayesian analysis for nor-mally distributed data The normal plays a critical role inthe hierarchical signal detection model We will assumethat participant and item effects on sensitivity and biasare distributed as normals The results and techniquesdeveloped in this section are used directly in analyzingthe subsequent hierarchical signal detection models

Consider the case in which a sequence of observa-tions w (w1 wN) are independent and identicallydistributed as normals

The goal is to estimate parameters μ and σ 2 In this sec-tion we discuss a more limited goalmdashthe estimation ofone of the parameters assuming that the other is knownIn the next section we will introduce a form of Markovchain Monte Carlo sampling for estimation when bothparameters are unknown

Estimating μμ With Known σσ 2

Assume that σ 2 is known and that the estimation of μis of primary interest The goal then is to derive the pos-

wjind~ Normal μ σ 2( )

p aa b

y aN a b

= primeprime + prime

= ++ +

f p yp p

a b

a b

( | )( )( )

= minusprime prime

primeminus primeminus1 11Be

Figure 7 Prior and posterior distributions for 7 successes outof 10 trials Both are distributed as betas Parameters for theprior are a 1 and b 1 parameters for the posterior are aprimeprime 8 and b primeprime 4 The posterior mean and 95 highest density regionare also indicated

582 ROUDER AND LU

terior f (μ |σ 2 w) According to the proportional formof Bayesrsquos theorem

All terms are conditioned on σ 2 to reflect the fact that itis known

The first step is to choose a prior distribution for μThe normal is a suitable choice

Parameters μ0 and σ 02 must be specified before analysis

according to the researcherrsquos beliefs about μ By choos-ing σ 0

2 to be increasingly large the researcher can makethe prior arbitrarily variable In the limit the prior ap-proaches having equal density across all values Thisprior is considered noninformative for μ

The density of the prior is given by

Note that the prior on μ does not depend on σ 2 there-fore f (μ) f (μ |σ2)

The next step is multiplying the likelihood f (w | μσ 2) by the prior density f (μ |σ 2) For normally distrib-uted data the likelihood is given by

(22)

Multiplying the equation above by the prior yields

We concern ourselves with only those terms that involveμ the parameter of interest Other terms are used to nor-malize the posterior density and are absorbed into theconstant of proportionality

where

(23)

and

(24)

The posterior is proportional to the density of a normalwith mean and variance given by μprime and σ 2prime respectivelyBecause the conditional posterior distribution must inte-grate to 10 this distribution must be that of a normalwith mean μprime and variance σ 2prime

(25)Unfortunately the notation does not make it clear thatboth μprime and σ 2prime depend on the value of σ 2 When this de-pendence is critical it will be made explicit μprime(σ 2) andσ 2prime(σ 2)

Because the posterior and prior are from the samefamily the normal distribution is the conjugate prior forthe population mean with normally distributed data (withknown variance) In the limit that the prior is diffuse (ieas its variance is made increasingly large) the posteriorof μ is a normal with μprime w and σ 2prime σ 2N This dis-tribution is also the sampling distribution of the samplemean in classical statistics Hence for the diffuse priorthe Bayesian and conventional approaches yield the sameresults

Figure 8 shows the estimation for an example in whichthe data are w (112 106 104 111) and the variance isassumed known to be 16 The mean of these four obser-vations is 10825 For demonstration purposes the prior

μ σ μ σ| ~ 2 2w Normal prime( )prime

prime = +⎛⎝⎜

⎞⎠⎟

minus minus primesumμ σ σ μ σ20

20

2w jj

σ σ σprime minus minus minus= +( )2 2

02 1

N

f μ σμ μ

σ| exp 2

2

22w( ) minus minus prime( )⎛

⎝⎜⎜

⎠⎟⎟prime

f N( | ) expμ σσ σ

μ22

02

212

1w minus +⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

⎣⎢⎢

⎝⎜

minusminussum

+⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

+sum

+⎧⎨

22

0

02

2

202

02

w

w

jj

jj

σμσ

μ

σμσ

⎪⎪

⎩⎪

⎫⎬⎪

⎭⎪

⎦⎥⎥

⎠⎟

fw wj jjμ σ

μ μ

σ

μ

| exp22 2

212

2w( ) minus

minus +( )⎡

⎣⎢⎢

⎝⎜⎜

+

sum

220 0

2

02

2minus + ⎤

⎦⎥⎥

⎠⎟

μ μ μσ

fw j

j

μ σμ

σ

μ μ

| exp

exp

2

2

2

0

2w( ) minus minus( )⎛

⎜⎜

⎟⎟

timesminus minus

sum

(( )⎛

⎝⎜⎜

⎠⎟⎟

2

022σ

fw

N

j

j

μ σπσ

μ

σ| exp

2

2 2

2

21

2 2w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

timesminus minus( )⎛

⎝⎜⎜

⎠⎟⎟

1

2 20

0

2

02π σ

μ μ

σexp

fw

N

j

j

w | exp

μ σπσ

μ

σ2

2 2

2

21

2 2( ) =

( )minus minus( )⎛

⎜⎜

⎟sum ⎟⎟

f ( ) exp μπσ

μ μσ

=minus minus( )⎛

⎝⎜⎜

⎠⎟⎟

1

2 20

0

2

02

μ μ σ~ Normal 0 02( )

f f fμ σ μ σ μ σ| | | 2 2 2w w( ) ( ) ( )

HIERARCHICAL BAYESIAN MODELS 583

is constructed to be fairly informative with μ0 100and σ 0

2 100 The posterior is narrower than the priorreflecting the information gained from the data It is cen-tered at μprime 1079 with variance 385 Note that theposterior mean is not the sample mean but is modestlyinfluenced by the prior Whether the Bayesian or the con-ventional estimate is better depends on the accuracy ofthe prior For cases in which much is known about thedependent measure it may be advantageous to reflectthis information in the prior

Estimating σσ 2 With Known μμIn the last section the posterior for μ was derived for

known σ 2 In this section it is assumed that μ is knownand that the estimation of σ 2 is of primary interest Thegoal then is to estimate the posterior f (σ 2 | μ w) The inverse-gamma distribution is the conjugate prior for σ 2

with normally distributed data The inverse-gamma dis-tribution is not widely used in psychology As its nameimplies it is related to the gamma distribution If a ran-dom variable g is distributed as a gamma then 1g is dis-tributed as an inverse gamma An inverse-gamma prioron σ 2 has density

The parameters of the inverse gamma are a and b and inapplication these are chosen beforehand As a and b ap-proach 0 the inverse gamma approaches 1σ 2 which isconsidered the appropriate noninformative prior for σ 2

(Jeffreys 1961)

Multiplying the inverse-gamma prior by the likelihoodgiven in Equation 22 yields

Collecting only those terms dependent on σ 2 yields

where

aprime N2 a (26)and

(27)

The posterior is proportional to the density of an inversegamma with parameters aprime and bprime Because this posteriorintegrates to 10 it must also be an inverse gamma

σ 2 | μ w ~ Inverse Gamma (aprime bprime) (28)

Note that posterior parameter bprime depends explicitly onthe value of μ

MARKOV CHAIN MONTE CARLO SAMPLING

The preceding development highlights a problem Theuse of conjugate priors allowed for the derivation of pos-teriors only when some of the parameters were assumedto be known The posteriors in Equations 25 and 28 areknown as conditional posteriors because they depend onother parameters The quantities of primary interest arethe marginal posteriors μ | w and σ 2 | w The straightfor-

prime = minus( )⎡⎣⎢

⎤⎦⎥

+sumb w bj j μ2

2

f ba

σ μσ σ

2

2 1 21| exp w( )

( )minus prime⎛

⎝⎜⎞⎠⎟prime+

f

w b

N a

jj

σ μσ

μ

σ

2

2 2 1

2

2

1

5

|

exp

w( )

( )times minus

minus( ) +⎛

+ +

sum

⎝⎝

⎜⎜

⎟⎟

fw

N

j

j

σ μσ

μ

σ2

2 2

2

21

2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟⎟

times

sum

112 1 2

σ σ( )minus⎛

⎝⎜⎞⎠⎟+a

bexp

fw

N

j

j

σ μπσ

μ

σ2

2 2

2

21

2 2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

times( )

minus⎛⎝⎜

⎞⎠⎟+

b

a

ba

aΓ( )

exp σ σ2 1 2

f b

a

b a ba

σ σσ2

2 1 22 0( ) =

( )minus⎛

⎝⎜⎞⎠⎟

ge+

Γ( )exp

Figure 8 Prior and posterior distributions for estimating themean of normally distributed data with known variance Theprior and posterior are the dashed and solid distributions re-spectively The vertical line shows the sample mean The priorldquopullsrdquo the posterior slightly downward in this case

584 ROUDER AND LU

ward method of obtaining marginals is to integrate conditionals

In this case the integral is tractable but in many cases itis not When an integral is symbolically intractable how-ever it can often still be evaluated numerically One ofthe recent developments in numeric integration is Markovchain Monte Carlo (MCMC) sampling MCMC sam-pling has become quite popular in the last decade and isdescribed in detail in Gelfand and Smith (1990) We showhere how MCMC can be used to estimate the marginalposterior distributions of μ and σ 2 without assumingknown values We use a particular type of MCMC sam-pling known as Gibbs sampling (Geman amp Geman 1984)Gibbs sampling is a restricted form of MCMC the moregeneral form is known as MetropolisndashHastings MCMC

The goal is to find the marginal posterior distribu-tions MCMC provides a set of random samples from themarginal posterior distributions (rather than a closed-form derivation of the posterior density or cumulativedistribution function) Obtaining a set of random sam-ples from the marginal posteriors is sufficient for analy-sis With a sufficiently large sample from a random vari-able one can compute its density mean quantiles andany other statistic to arbitrary precision

To explain Gibbs sampling we will digress momen-tarily using a different example Consider the case ofgenerating samples from an ex-Gaussian random vari-able The ex-Gaussian is a well-known distribution incognitive psychology (Hohle 1965) It is the sum of nor-mal and exponential random variables The parametersare μ σ and τ the location and scale of the normal andthe scale of the exponential respectively An ex-Gaussianrandom variable x can be expressed in conditional form

andη ~ Exponential(τ )

We can take advantage of this conditional form to gen-erate ex-Gaussian samples First we generate a set ofsamples from η Samples will be denoted with squarebrackets [η]1 denotes the first sample [η]2 denotes thesecond and so on These samples can then be used in theconditional form to generate samples from the marginalrandom variable x For example the first two samples aregiven by

and

In this manner a whole set of samples from the mar-ginal x can be generated from the conditional randomvariable x |η Figure 9 shows that the method works Ahistogram of [x] generated in this manner converges tothe true ex-Gaussian density This is an example of MonteCarlo integration of a conditional density

We now return to the problem of deriving marginalposterior distributions of μ | w from the conditionalμ |σ 2 w If there was a set of samples from σ 2 | w MonteCarlo integration could be used to sample μ | w Like-wise if there was a set of samples of μ | w Monte Carlointegration could be used to sample σ 2 | w In Gibbs sam-pling these relations are used iteratively as followsFirst a value of [σ 2]1 is picked arbitrarily This value isthen used to sample the posterior of μ from the condi-tional in Equation 25

where μprime and σ 2 prime are the explicit functions of σ 2 givenin Equations 23 and 24 Next the value of [μ]1 is used tosample σ 2 from the conditional posterior in Equation 28In general for iteration m

(29)

and

(30)

In this manner sequences of random samples [μμ] and[σσ 2] are obtained

The initial samples of [μμ] and [σσ 2] certainly reflect thechoice of starting values [σ 2]1 and are not samples fromthe desired marginal posteriors It can be shown how-ever that under mild technical conditions the later sam-ples of [μμ] and [σσ 2] are samples from the desired mar-ginal posteriors (Tierney 1994) The initial region in

σ μ21

⎡⎣ ⎤⎦ prime prime ⎡⎣ ⎤⎦( )( )minusm ma b~ Inverse Gamma

[ ] μ μ σ σ σm m m~ Normal prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime2 2 2

μ μ σ σ σ⎡⎣ ⎤⎦ prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime1

2 2~ Normal1

2

1

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )2 22~ Normal +μ η σ

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )1 12~ Normal +μ η σ

x | ~ η μ η σNormal +( )2

f f f d( | ) | | μ μ σ σ σw w w= ( ) ( )int 2 2 2

Figure 9 An example of Monte Carlo integration of a normalconditional density to yield ex-Gaussian distributed samples Thehistogram presents the samples from a normal whose mean pa-rameter is distributed as an exponential The line is the densityof the appropriate ex-Gaussian distribution

HIERARCHICAL BAYESIAN MODELS 585

which the samples reflect the starting value is termed theburn-in period The later samples are the steady-state pe-riod Formally speaking the samples form irreducibleMarkov chains with stationary distributions that equalthe marginal posterior distribution (see Gilks Richard-son amp Spiegelhalter 1996) On a more informal levelthe first set of samples have random and arbitrary valuesAt some point by chance the random sample happens tobe from a high-density region of the true marginal pos-terior From this point forward the samples are from thetrue posteriors and may be used for estimation

Let m0 denote the point at which the samples are ap-proximately steady state Samples after m0 can be used toprovide both posterior means and credible regions of theposterior (as well as other statistics such as posteriorvariance) Posterior means are estimated by the arith-metic means of the samples

and

where M is the number of iterations Credible regions areconstructed as the centermost interval containing 95of the samples from m0 to M

To use MCMC to estimate posteriors of μ and σ 2 it isnecessary to sample from a normal and an inverse-gamma

distribution Sampling from a normal is provided in manysoftware packages and routines for low-level languagessuch as C or Fortran are readily available Samples froman inverse-gamma distribution may be obtained by tak-ing the reciprocal of samples from a gamma distributionAhrens and Dieter (1974 1982) provide algorithms forsampling the gamma distribution

To illustrate Gibbs sampling consider the case of es-timating IQ The hypothetical data for this example arefour replicates from a single participant w (112 106104 111) The classical estimation of μ is the samplemean (10825) and confidence intervals are constructedby multiplying the standard error (sw193) by the ap-propriate critical t value with three degrees of freedomFrom this multiplication the 95 confidence intervalfor μ is [1022 1144] Bayesian estimation was donewith Gibbs sampling Diffuse priors were placed on μ(μ0 0 σ 0

2 106) and σ 2 (a 106 b 106) Twochains of [μμ] are shown in the left panel of Figure 10each corresponding to a different initial value for [σ 2]1As can be seen both chains converge to their steady staterather quickly For this application there is very littleburn-in The histogram of the samples of [μμ] after burn-in is also shown (right panel) The histogram conforms toa t distribution with three degrees of freedom The poste-rior mean is 10826 and the 95 credible interval is[1022 1144] For normally distributed data the diffusepriors used in Bayesian analysis yield results that are nu-merically equivalent to their frequentist counterparts

The diffuse priors used in the previous case representthe situation in which researchers have no previous knowl-

σσ

2

2

0

0=

⎡⎣ ⎤⎦

minusgtsum

mm m

M m

μμ

=minus

gtsum [ ]m

m m

M m0

0

15

10

05

0

120

115

110

105

100

0 10 20 30 40 50 95 100 105 110 115 120IQ PointsIteration

Den

sity

IQ points

micro

Burn-in

Figure 10 Samples of μμ from Markov chain Monte Carlo integration for normally dis-tributed data The left panel shows two chains each from a poor choice of the initial value ofσσ 2 (the solid and dotted lines are for [σσ 2]1 10000 and [σσ 2]1 0001 respectively) Con-vergence is rapid The right panel shows the histogram of samples of μμ after burn-in Thedensity is the appropriate frequentist t distribution with three degrees of freedom

586 ROUDER AND LU

edge In the IQ example however much is known aboutthe distribution of IQ scores Suppose our test was normedfor a mean of 100 and a standard deviation of 15 Thisknowledge could be profitably used by setting μ0 100and σ 0

2 152 225 We may not know the particulartestndashretest correlation of our scales but we can surmisethat the standard deviations of replicates should be lessthan the population standard deviation After experiment-ing with the values of a and b we chose a 3 and b 100 The prior on σ 2 corresponding to these values isshown in Figure 11 It has much of its density below 100reflecting our belief that the testndashretest variability issmaller than the variability in the population Figure 11also shows a histogram of the posterior of μ (the posteriorwas obtained with Gibbs sampling as outlined above[σ 2]1 1 burn-in of 100 iterations) The posterior meanis 10795 which is somewhat below the sample mean of10825 This difference is from the prior people on aver-age score lower than the obtained scores Although it is abit hard to see in the figure the posterior distribution ismore narrow in the extreme tails than is the correspond-ing t distribution The 95 credible interval is [10211136] which is modestly different from the correspond-ing frequentist confidence interval This analysis shows

that the use of reasonable prior information can lead to aresult different from that of the frequentist analysis TheBayesian estimate is not only different but often more ef-ficient than the frequentist counterpart

The theoretical underpinnings of MCMC guaranteethat infinitely long chains will converge to the true pos-terior Researchers must decide how long to burn in chainsand then how long to continue in order to approximatethe posterior well The most important consideration inthis decision is the degree of correlation from one sam-ple to another In MCMC sampling consecutive samplesare not necessarily independent of one another The valueof a sample on cycle m 1 is in general dependent onthat of m If this dependence is small then convergencehappens quickly and relatively short chains are neededIf this dependence is large then convergence is muchslower One informal graphical method of assessing thedegree of dependence is to plot the autocorrelation func-tions of the chains The bottom right panel of Figure 11shows that for the normal application there appears to beno autocorrelation that is there is no dependence fromone sample to the next This fact implies that conver-gence is relatively rapid Good approximations may beobtained with short burn-ins and relatively short chains

Figure 11 Analysis with informative priors The two top panels showthe prior distributions on μμ and σσ 2 The bottom left panel shows the his-togram of samples of μμ after burn-in It differs from a t distribution fromfrequentist analysis in that it is shifted toward 100 and has a smallerupper tail The bottom right panel shows the autocorrelation functionfor μμ There is no discernible autocorrelation

HIERARCHICAL BAYESIAN MODELS 587

There are a number of formal procedures for assessingconvergence in the statistical literature (see eg Gel-man amp Rubin 1992 Geweke 1992 Raftery amp Lewis1992 see also Gelman et al 2004)

A PROBIT MODEL FOR ESTIMATING A PROBABILITY OF SUCCESS

The next step toward a hierarchical signal detectionmodel is to revisit estimation of a probability from indi-vidual trials Once again y successes are observed in Ntrials and y ~ Binomial(N p) where p is the parameterof interest In the previous section a beta prior was placedon parameter p and the conjugacy of the beta with the bi-nomial directly led to the posterior Unfortunately a betadistribution prior is not convenient as a prior for the sig-nal detection application

The signal detection model relies on the probit trans-form Figure 12 shows the probit transform which mapsprobabilities into z values using the normal inverse cu-mulative distribution function Whereas p ranges from 0to 1 z ranges across the real number line Let Φ denotethe standard normal cumulative distribution function Withthis notation p Φ(z) and conversely z Φ1( p)

Signal detection parameters are defined in terms ofprobit transforms

and

where p(h) and p(f ) are hit and false alarm probabilities re-spectively As a precursor to models of signal detectionwe develop estimation of the probit transform of a proba-bility of success z Φ1( p) The methods for doing sowill then be used in the signal detection application

When using a probit transform we model the y suc-cesses in N trials as

y ~ Binomial[N Φ(z)]

where z is the parameter of interestWhen using a probit it is convenient to place a normal

prior on z

(31)

Figure 13 shows various normal priors on z and by trans-form the corresponding prior on p Consider the specialcase when a standard normal is placed on z As is shownin the figure the corresponding prior on p is flat A flatprior also corresponds to a beta with a b 1 (see Fig-ure 6) From the previous development if a beta prior isplaced on p the posterior is also a beta with parametersaprime a y and bprime b (N y) Noting for a flat priorthat a b 1 it is expected that the posterior of p is dis-tributed as a beta with parameters aprime 1 y and bprime 1 (N y)

The next step is to derive the posterior f (z |Y ) for anormal prior on z The straightforward approach is tonote that the posterior is proportional to the likelihoodmultiplied by the prior

This equation is intractable Albert and Chib (1995) pro-vide an alternative for analysis and the details of this al-ternative are provided in Appendix A The basic idea isthat a new set of latent variables is defined upon which zis conditioned Conditional posteriors for both z andthese new latent variables are easily derived and sam-pled These samples can then be used in Gibbs samplingto find a marginal posterior for z

In this application it may be desirable to estimate theposterior of p as well as that of z This estimation is easyin MCMC Because p is the inverse probit transform ofz samples can also be inversely transformed Let [z] bethe chain of samples from the posterior of z A chain ofposterior samples of p is given by [ p]m Φ( [z]m)

The top right panel of Figure 14 shows the histogramof the posterior of p for y 7 successes on N 10 tri-als The prior parameters were μ0 0 and σ 0

2 1 so theprior on p was flat (see Figure 13) Because the prior isflat it corresponds to a beta with parameters (a 1 b 1) Consequently the expected posterior is a beta withdensity aprime y 1 8 and bprime (N y) 1 4 Thechains in the Gibbs sampler were started from a numberof values of [z]1 and convergence was always rapid Theresulting histogram for 10000 samples is plotted it ap-proximates the expected distribution The bottom panelshows a small degree of autocorrelation This autocorre-lation can be overcome by running longer chains andusing a more conservative burn-in period Chains of

f z Y z z

z

Y N Y( | ) ( ) [ ( )]

exp

Φ Φ1

20

2

02 1

minus

times minus minus( ) ( )minus

minusμ σ⎡⎡

⎣⎢⎤⎦⎥

z ~ Normal 0μ σ02( )

c p= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pΦ Φ1 1( ) ( )h f

Figure 12 Probit transform of probability ( p) into z scores (z)The transform is the inverse cumulative distribution function ofthe normal distribution Lines show the transform of p 69 intoz 05

588 ROUDER AND LU

1000 samples with 100 samples of burn-in are morethan sufficient for this application

ESTIMATING PROBABILITY FORSEVERAL PARTICIPANTS

In this section we propose a model for estimating sev-eral participantsrsquo probability of success The main goalis to introduce hierarchical models within the context ofa relatively simple example The methods and insightsdiscussed here are directly applicable to the hierarchicalmodel of signal detection Consider data from a numberof participants each with his or her own unique proba-bility of success pi In conventional estimation each ofthese probabilities is estimated separately usually by theestimator piyi Ni where yi and Ni are the number ofsuccesses and trials respectively for the ith participantWe show here that hierarchical modeling leads to moreaccurate estimation

In a hierarchical model it is assumed that although in-dividuals vary in ability the distribution governing indi-vidualsrsquo true abilities is orderly We term this distributionthe parent distribution If we knew the parent distributionthis prior information would be useful in estimationConsider the example in Figure 15 Suppose 10 individ-uals each performed 10 trials Hypothetical performanceis indicated with Xs Letrsquos suppose the task is relativelyeasy so that individualsrsquo true probabilities of successrange from 7 to 10 as indicated by the curve in the fig-

ure One participant does poorly however succeeding ononly 3 trials The conventional estimate for this poor-performing participant is 3 which is a poor estimate ofthe true probability (between 7 and 10) If the parentdistribution is known it would be evident that this poorscore is influenced by a large degree of sample noise andshould be corrected upward as indicated by the arrowThe new estimate which is more accurate is denoted byan H (for hierarchical)

In the example above we assumed a parent distribu-tion In actuality such assumptions are unwarranted Inhierarchical models parent distributions are estimatedfrom the data these parents in turn affect the estimatesfrom outlying data The estimated parent for the data inFigure 15 would have much of its mass in the upperranges When serving as a prior the estimated parentwould exert an upward tendency in the estimation of theoutlying participantrsquos probability leading to better esti-mation In Bayesian analysis the method of implement-ing a hierarchical model is to use a hierarchical priorThe parent distribution serves as a prior for each indi-vidual and this parent is termed the first stage of theprior The parent distribution is not fully specified butinstead free parameters of the parent are estimated fromthe data These parameters also need a prior on themwhich is termed the second stage of the prior

The probit transform model is ideally suited for a hi-erarchical prior A normally distributed parent may beplaced on transformed probabilities of success The ith

Prior 1 Prior 2 Prior 3

Priorson z

Corresponding Priors on p

N(01) N(11) N(0 2)4

3

2

1

0

4

3

2

1

0

4

3

2

1

0

ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6

5

4

3

2

1

0

5

4

3

2

1

0

5

4

3

2

1

0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Figure 13 Three normally distributed priors on z and the corresponding priors on p Priors on z were obtainedby drawing 100000 samples from the normal Priors on p were obtained by transforming each sample of z

HIERARCHICAL BAYESIAN MODELS 589

participantrsquos true ability zi is simply a sample from thisnormal The number of successes is modeled as

(32)

with a parent distribution (first-stage prior) given by

(33)

In this case parameters μ and σ 2 describe the locationand variance of the parent distribution and reflect group-level properties Priors are needed on μ and σ 2 (second-stage priors) Suitable choices are

(34)

andσ 2 ~ Inverse Gamma (a b) (35)

Values of μ0 σ 02 a and b must be specif ied before

analysisWe performed an analysis on hypothetical data in order

to demonstrate the hierarchical model Table 2 shows hy-pothetical data for a set of 20 participants performing 50trials These success frequencies were generated from abinomial but each individual had a unique true proba-bility of success These true probabilities2 which rangedbetween 59 and 80 are shown in the second column ofTable 2 Parameter estimators p0 (Equation 12) and p2(Equation 14) are shown

All of the theoretical results needed to derive the con-ditional posterior distributions have already been pre-

sented The Albert and Chib (1995) analysis for this ap-plication is provided in Appendix B A coded version ofMCMC for this model is presented in Appendix C

The MCMC chain was run for 10000 iterations Pri-ors were set to be fairly diffuse (μ0 0 σ 0

2 1000 a b 1) All parameters converged to steady-state behav-ior quickly (burn-in was conservative at 100 iterations)Autocorrelation of samples of pi for 2 select participantsis shown in the bottom panels of Figure 16 The auto-correlation in these plots was typical of other param-eters Autocorrelation for this model was no worse thanthat for the nonhierarchical Bayesian model of the pre-ceding section The top left panel of Figure 16 showsposterior means of the probability of success as a func-tion of the true probability of success for the hierarchi-cal model (filled circles) Also included are classical estimates from p0 (filled circles) Points from the hier-archical model are on average closer to the diagonalAs is indicated in Table 2 the hierarchical model haslower root mean squared error than does its frequentistcounterpart p0 The top right panel shows the hierarchi-cal estimates as a function of p0 The effect of the hier-archical structure is clear The extreme estimates in thefrequentist approach are moderated in the Bayesiananalysis

Estimator p2 also moderates extreme values The ten-dency is to pull them toward the middle specifically es-timates are closer to 5 than they are with p0 For the dataof Table 2 the difference in efficiency between p2 andp0 is exceedingly small (about 0001) The reason thatthere was little gain for p2 is that it pulled estimates to-ward a value of 5 which is not the mean of the trueprobabilities The mean was p 69 and for this valueand these sample sizes estimators p2 and p0 have aboutthe same efficiency The hierarchical estimator is supe-

μ μ σ~ Normal 02( )

zi | μ σ μ σ2 ind~ Normal 2( )

y z N zi i i i| ind~ Binomial Φ( )⎡⎣ ⎤⎦

Figure 14 (Top) Histograms of the post-burn-in samples of zand p respectively for 7 successes out of 10 trials The line fit forp is the true posterior a beta distribution with a primeprime 8 and bprimeprime 4(Bottom) Autocorrelation function for z There is a small degreeof autocorrelation

Figure 15 Example of how a parent distribution affects esti-mation The outlying observation is incompatible with the parentdistribution Bayesian estimation allows for the influence of theparent as a prior The resulting estimate is shifted upward re-ducing error

590 ROUDER AND LU

rior to these nonhierarchical estimators because it pullsoutlying estimates toward the mean value p Of coursethe advantage of the hierarchical model is a function ofsample size With increasing sample sizes the amountof pulling decreases In the limit all three estimatorsconverge to the true individualrsquos probability

A HIERARCHICAL SIGNAL DETECTION MODEL

Our ultimate goal is a signal detection model that ac-counts for both participant and item variability As a pre-cursor we present a hierarchical signal detection modelwithout item variability the model will be expanded inthe next section to include item variability The modelhere is applicable to psychophysical experiments in whichsignal and noise trials are identical replicates respectively

Signal detection parameters are simple subtractions ofprobit-transformed probabilities Let d primei and ci denotesignal detection parameters for the ith participant

and

where pi(h) and p

i(f ) are individualsrsquo true hit and false

alarm probabilities respectively Let hi and fi be the pro-bit transforms hi1( p i

(h)) and fi1( pi(f )) Then

sensitivity and criteria are given by

d primei hi fiand

ci fi

The tactic we follow is to place hierarchical models on hand f rather than on signal detection parameters directlyBecause the relationship between signal detection pa-rameters and probit-transformed probabilities is linearaccurate estimation of hi and fi results in accurate esti-mation of d primei and ci

The model is developed as follows Let Ni and Si de-note the number of noise and signal trials given to the ithparticipant Let y

i(h) and y

i(f ) denote the the number of

hits and false alarms respectively The model for yi(h) and

yi(f ) is given by

and

It is assumed that each participantrsquos parameters aredrawn from parent distributions These parent distribu-tions serve as the first stage of the hierarchical prior andare given by

(36)

and

(37)

Second-stage priors are placed on parent distribution pa-rameters (μk σ k

2) as

(38)

and

(39)σ k k ka b k2 ~ Inverse Gamma h f( ) =

μk k ku v k~ Normal h f( ) =

f iind~ Normal f fμ σ 2( )

hiind~ Normal h hμ σ 2( )

y f N fi i i i( ) | f ind

~ Binomial Φ( )⎡⎣ ⎤⎦

y h S hi i i i( ) | h ind

~ Binomial Φ( )⎡⎣ ⎤⎦

c pi i= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pi i iΦ Φ1 1( ) ( )h f

Table 2Data and Estimates of a Probability of Success

Participant True Prob Data p0 p2 Hierarchical

1 587 33 66 654 6722 621 28 56 558 6153 633 34 68 673 6834 650 29 58 577 6275 659 32 64 635 6606 665 37 74 731 7167 667 30 60 596 6388 667 29 58 577 6279 687 34 68 673 684

10 693 34 68 673 68211 696 34 68 673 68312 701 37 74 731 71513 713 38 76 750 72814 734 38 76 750 72715 736 37 74 731 71716 738 33 66 654 67217 743 37 74 731 71618 759 33 66 654 67219 786 38 76 750 72820 803 42 84 827 771

RMSE 053 053 041Data are numbers of successes in 50 trials

HIERARCHICAL BAYESIAN MODELS 591

Analysis takes place in two steps The first is to obtainposterior samples from hi and fi for each participantThese are probit-transformed parameters that may be es-timated following the exact procedure in the previoussection Separate runs are made for signal trials (produc-ing estimates of hi) and for noise trials (producing esti-mates of fi) These posterior samples are denoted [hi] and[ fi] respectively The second step is to use these posteriorsamples to estimate posterior distributions of individuald primei and ci This is easily accomplished For all samples

(40)

and(41)

Hypothetical data and parameter estimates are shownin Table 3 Individual variability in the data was imple-mented by sampling each participantrsquos d prime and c from alog-normal and from a normal distribution respectivelyTrue d prime values are shown and they range from 123 to234 From these true values of d prime and c true individualhit and false alarm probabilities were computed and in-dividual hit and false alarm frequencies were sampledfrom a binomial Hit and false alarm frequencies wereproduced in this manner for 20 hypothetical individualseach observing 50 signal and 50 noise trials

A conventional sensitivity estimate for each individ-ual was obtained using the formula

Φ1(Hit rate) Φ1(False alarm rate)

The resulting estimates are shown in the last column ofTable 3 labeled Conventional The hierarchical signaldetection model was analyzed with diffuse prior param-eters ah af bh bf 01 uh uf 0 and vh vf 1000 The chain was run for 10000 iterations with thefirst 500 serving as burn-in This choice of burn-in pe-riod is very conservative Autocorrelation for sensitivityfor 2 select participants is displayed in the bottom pan-els of Figure 17 this degree of autocorrelation was typ-ical for all parameters Autocorrelation was about thesame as in the previous application and the resulting es-timates are shown in the next-to-last column of Table 3

The top left panel of Figure 17 shows a plot of esti-mated dprime as a function of true d prime Estimates from the hi-erarchical model are closer to the diagonal than are theconventional estimates indicating more accurate esti-mation Table 3 also provides efficiency information thehierarchical model is 33 more efficient than the con-ventional estimates (this amount is typical of repeatedruns of the same simulation) The reason for this gain isevident in the top right panel of Figure 17 The extremeestimates in the conventional method which most likely

c fi m i m⎡⎣ ⎤⎦ = minus ⎡⎣ ⎤⎦

prime⎡⎣ ⎤⎦ = ⎡⎣ ⎤⎦ minus ⎡⎣ ⎤⎦d h fi m i m m

ConventionalHierarchical

Participant 5 Participant 15

True Probability Conventional Estimates

Lag Lag

5 6 7 8 9 5 6 7 8 9

0 10 20 30 40 0 10 20 30 40

Aut

ocor

rela

tion

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

9

8

7

6

5

8

4

0

9

8

7

6

5

8

4

0

Figure 16 Modeling the probability of success across several partici-pants (Top left) Probability estimates as functions of true probabilitythe triangles represent estimator p0 and the circles represent posteriormeans of pi from the hierarchical model (Top right) Hierarchical esti-mates as a function of p0 (Bottom) Autocorrelation functions of the pos-terior probability estimates for 2 participants from the hierarchicalmodel The degree of autocorrelation is typical of the other participants

592 ROUDER AND LU

reflect sample noise are shrunk to more moderate esti-mates in the hierarchical model

SIGNAL DETECTION WITH ITEMVARIABILITY

In the introduction we stressed that Bayesian hierar-chical modeling may provide a solution for the statisti-cal problems associated with unmodeled variability innonlinear contexts In all of the previous examples thegains from hierarchical modeling were in better effi-ciency Conventional analysis although not as accurateas Bayesian analysis certainly yielded consistent if notunbiased estimates In this section item variability isadded to signal detection Conventional analyses inwhich scores are aggregated over items result in incon-sistent estimates characterized by asymptotic bias Inthis section we show that in contrast to conventionalanalysis Bayesian hierarchical models provide consis-tent and accurate estimation

Our route to the model is a bit circuitous Our first at-tempt based on a straightforward extension of the pre-vious techniques fails because of an excessive amountof autocorrelation The solution to this failure is to use apowerful and recently developed sampling schemeblocked sampling (Roberts amp Sahu 1997) First we willproceed with the straightforward but problematic exten-sion Then we will introduce blocked sampling and showhow it leads to tractable analysis

Straightforward ExtensionIt is straightforward to specify models that include

item effects The previous notation is here expanded byindexing hits and false alarms by both participants and

items Let yij(h) and yij

(f ) denote the response of the ith per-son to the jth item Values of yij

(h) and yij(f ) are dichoto-

mous indicating whether the response was old or newLet pij

(h) and pij(f ) be true probabilities that yij

(h) 1 andyij

(f ) 1 respectively that is these probabilities are truehit and false alarm rates for a given participant-by-itemcombination respectively Let hij and fij be the probittransforms [hij Φ1( pij

(h)) fij Φ1( pij(f ))] The model

is given by

(42)

and

(43)

Linear models are placed on hij and fij

hij μh αi γj (44)

and

fij μf βi δj (45)

Parameters μh and μf correspond to the grand means ofthe hit and false alarm rates respectively Parameters ααand ββ denote participant effects on hits and false alarms respectively parameters γγ and δδ denote item effects onhits and false alarms respectively These participant anditem effects are assumed to be zero-centered random ef-fects Grand mean signal detection parameters denotedd prime and c are given by

d prime μh μf (46)

and

c μf (47)

y fij ij( ) f ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

y hij ij( )h ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

Table 3Data and Estimates of Sensitivity

FalseParticipant True d prime Hits Alarms Hierarchical Conventional

1 1238 35 7 1608 16052 1239 38 17 1307 11193 1325 38 11 1554 14784 1330 33 16 1146 08805 1342 40 19 1324 11476 1405 39 8 1730 17677 1424 37 12 1466 13508 1460 27 5 1417 13829 1559 35 9 1507 1440

10 1584 37 9 1599 155911 1591 33 8 1486 140712 1624 40 7 1817 192213 1634 35 11 1427 129714 1782 43 13 1687 172415 1894 33 3 1749 196716 1962 45 12 1839 198817 1967 45 4 2235 268718 2124 39 6 1826 194719 2270 45 5 2172 256320 2337 46 6 2186 2580

RMSE 183 275Data are numbers of signal responses in 50 trials

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 5: Rouder & Lu (2005)

HIERARCHICAL BAYESIAN MODELS 577

cious is signal detection analyses of subliminal semanticactivation Greenwald Draine and Abrams (1996) forexample used signal detection sensitivity aggregatedacross items to claim that a set of word primes was sub-liminal (d prime 0) These subliminal primes then affect re-sponse times to subsequent stimuli The asymptoticdownward bias from item variability however renderssuspect the claim that the primes were truly subliminal

The presence of asymptotic bias with unmodeled vari-ability is not unique to signal detection In general un-modeled variability leads to asymptotic bias in nonlinearmodels Psychologists have long known about problemsassociated with aggregation in specific contexts For ex-ample Estes (1956) warned about aggregating learningcurves across participants Aggregate learning curves maybe more graded than those of individuals leading re-searchers to misdiagnose underlying mechanisms(Haider amp Frensch 2002 Heathcote Brown amp Mewhort2000) Curran and Hintzman (1995) critiqued aggrega-tion in Jacobyrsquos (1991) process dissociation procedureThey showed that aggregating responses over partici-pants items or both possibly leads to asymptotic under-estimation of automaticity estimates Ashby Maddoxand Lee (1994) showed that aggregating data across par-ticipants possibly distorts estimates from similaritychoice-based scaling (such as those from Gilmore HershCaramazza amp Griff in 1979) Each of the examplesabove is based on the same general problem Unmodeledvariability distorts estimates in a nonlinear context In allof these cases the distortion does not diminish with in-creasing sample sizes Unfortunately the field has beenslow to grasp the general nature of this problem The cri-tiques above were made in isolation and appeared as spe-cific critiques of specific models Instead we see them

as different instances of the negative effects of aggregat-ing data in analyses with nonlinear models

The solution is to model both participant and itemvariability simultaneously We advocate Bayesian hier-archical models for this purpose (Rouder Lu Speck-man Sun amp Jiang 2005 Rouder Sun Speckman Luamp Zhou 2003) In a hierarchical model variability onseveral levels such as from the selection of items andparticipants as well as from measurement error is mod-eled simultaneously As was mentioned previously hier-archical models have been used in linear contexts to im-prove power and to better control Type I error rates (seeeg Baayen et al 2002 Kreft amp de Leeuw 1998) Al-though better statistical control is attractive to experi-mentalists it is not critical Instead it is often easier (andwiser) to replicate results than to learn and implementcomplex statistical methodologies For nonlinear mod-els however hierarchical models are critical becausebias is asymptotic

The main drawback to nonlinear hierarchical modelsis tractability They are difficult to implement Startingin the 1980s and 1990s however new computationaltechniques based on Bayesian analysis emerged in sta-tistics The utility of these new techniques is immensethey allow for the analysis of previously intractable mod-els including nonlinear hierarchical ones (Gelfand ampSmith 1990 Gelman Carlin Stern amp Rubin 2004Gill 2002 Tanner 1996) These new Bayesian-basedtechniques have had tremendous impact in many quanti-tatively oriented disciplines including statistics engi-neering bioinformatics and economics

The goal of this article is to provide an introduction toBayesian hierarchical modeling The focus is on a rela-tively new computational technique that has made

No Item Variability Item Variability

True Sensitivity True Sensitivity

Est

imat

ed S

ensi

tivity

30

25

20

15

10

05

0

30

25

20

15

10

05

0

0 05 10 15 20 25 30 0 05 10 15 20 25 30

Figure 3 The effect of unmodeled item variability (σσ2) on the estimation of sensitivity (d primeprime) True values of d primeprimeare the means of d primeprimeij across items The left and right panels respectively show the results without and with itemvariability (σσ 2 15) For both plots the value of σσ 1 was 5

578 ROUDER AND LU

Bayesian analysis more tractable Markov chain MonteCarlo sampling In the first section basic issues of esti-mation are discussed Then this discussion is expandedto include the basics of Bayesian estimation Followingthat Markov chain Monte Carlo integration is intro-duced Finally a series of hierarchical signal detectionmodels is presented These models provide unbiased andrelatively efficient sensitivity estimates for the standardequal-variance signal detection model even in the pres-ence of both participant and item variability

ESTIMATION

It is easier to discuss Bayesian estimation in the con-text of a simple example Consider the case in which asingle participant performs N trials The result of eachtrial is either a success or a failure We wish to estimatethe underlying true probability of a success denoted pfor this single participant Estimating a probability ofsuccess is essential to many applications in cognitivepsychology including signal detection analysis In signaldetection we typically estimate hit and false alarm ratesHits are successes on old-item trials and false alarms arefailures on new-item trials

A formula that provides an estimate is termed an esti-mator A reasonable estimator for this case is

(12)

where y is the number of successes out of N trials Esti-mator p0 is natural and straightforward

To evaluate the usefulness of estimators statisticiansusually discuss three basic properties bias efficiencyand consistency Bias and efficiency are illustrated inTable 1 The data are the results of weighing a hypothet-ical person of 170 lb on two hypothetical scales four sep-arate times Bias refers to the mean of repeated estimatesScale A is unbiased because the mean of the estimatesequals the true value of 170 lb Scale B is biased Themean is 172 lb which is 2 lb greater than the true valueof 170 lb Scale B however has a smaller degree of errorthan does Scale A so Scale B is termed more efficientthan Scale A Efficiency is the inverse of the expectederror of an observation and may be indexed by the reci-procal of root mean squared error Bias and efficiencyhave the same meaning for estimators as they do forscales Bias refers to the difference between the averagevalue of an estimator and a true value Efficiency refers

to the average magnitude of the difference between anestimator and a true value In many situations efficiencydetermines the quality of inference more than bias does

How biased and efficient is estimator p0 To providea context for evaluation consider the following two al-ternative estimators

(13)

and

(14)

These two alternatives may seem unprincipled but as isdiscussed in the next section they are justif ied in aBayesian framework

Figure 4 shows sampling distributions for the threeprobability estimators when the true probability is p 7and there are N 10 trials Estimator p0 is not biasedbut estimators p1 and p2 are Surprisingly estimator p2has the lowest average error that is it is the most effi-cient The reason is that the tails of the sampling distri-bution are closer to the true value of p 7 for p2 thanfor p1 or p0 Figure 4 shows the case for a single truevalue of p 7 Figure 5 shows bias and efficiency forall three estimators for the full range of p The conven-tional estimator p0 is unbiased for all true values of pbut the other two estimators are biased for extreme prob-abilities None of the estimators is always more efficientthan the others For intermediate probabilities estimatorp2 is most efficient for extreme probabilities estimatorp0 is most efficient Typically researchers have someidea of what type of probability of success to expect intheir experiments This knowledge can therefore be usedto help pick the best estimator for a particular situation

The other property of estimators is consistency Aconsistent estimator converges to its true value As thesample size is increased the estimator not only becomesunbiased the overall variance shrinks toward 0 Consis-tency should be viewed as a necessary property of a goodestimator Fortunately many estimators used in psychol-ogy are consistent including sample means sample vari-ances and sample correlation coefficients The three es-timators p0 p1 and p2 are consistent with sufficientdata they converge to the true value of p In fact all es-timators presented in this article are consistent

BAYESIAN ESTIMATION OF APROBABILITY

In this section we provide a Bayesian approach to es-timating the probability of success The datum in this ap-plication is the number of successes (y) and this quan-tity may be modeled as a binomial

y ~ Binomial(N p)

where N is known and p is the unknown parameter of interest

ˆ py

N212

= ++

ˆ p

yN1

51

= ++

ˆ pyN0 =

Table 1Weight of a 170-Pound Person Measured on

Two Hypothetical Scales

Scale A Scale B

Data (lb) 180 160 175 165 174 170 173 171Mean 170 172Bias 0 20RMSE 791 255

HIERARCHICAL BAYESIAN MODELS 579

Bayesrsquos (1763) insightful contribution was to use thelaw of conditional probability to estimate the parameterp He noted that

(15)

We describe each of these terms in orderThe main goal is to find f ( p | y) the quantity on the

left-hand side of Equation 15 The term f( p | y) is the dis-tribution of the parameter p given the data y This distri-bution is referred to as the posterior distribution It isviewed as a function of the parameter p The mean of theposterior distribution termed the posterior mean servesas a suitable point estimator for p

The term f(y | p) plays two roles in statistics depend-ing on context When it is viewed as a function of y forknown p it is known as the probability mass functionThe probability mass function describes the probabilityof any outcome y given a particular value of p For a bi-nomial random variable the probability mass function isgiven by

(16)

For example this function can be used to compute theprobability of observing a total of two successes on threetrials when p 5

38

The term f(y | p) can also be viewed as a function ofparameter p for fixed y In this case it is termed the like-lihood function and describes the likelihood of a partic-ular parameter value p given a fixed value of y For thebinomial the likelihood function is given by

(17)

In Bayesrsquos theorem we are interested in the posterior asa function of the parameter p Hence the right-handterms of Equation 15 are viewed as a function of param-eter p Therefore f (y | p) is viewed as the likelihood of prather than the probability mass of y

The term f (p) is the prior distribution and reflects theexperimenterrsquos a priori beliefs about the true value of p

Finally the term f (y) is the distribution of the datagiven the model Although its interpretation is importantin some contexts it plays a minimal role in the develop-

f y pN

yp p py N y( | ) ( ) =

⎛⎝⎜

⎞⎠⎟

minus le leminus1 0 1

f y p( | ) ( ) ( )= = =⎛⎝⎜

⎞⎠⎟

minus2 53

25 1 52 1

f y p

N

yp p y Ny N y

( | )( )

=⎛⎝⎜

⎞⎠⎟

minus =minus1 0

0 otherwiise

⎧⎨⎪

⎩⎪

f p yf y p f p

f y( | )

( | ) ( )( )

=

Figure 4 Sampling distributions of p0 p1 and p2 for N 10 trialswith a p 7 true probability of success on any one trial Bias and rootmean squared error (RMSE) are included

580 ROUDER AND LU

ment presented here for the following reason The pos-terior is a function of the parameter of interest p butf (y) does not depend on p The term f (y) is a constantwith respect to p and serves as a normalizing constantthat ensures that the posterior density integrates to 1 Aswill be shown in the examples the value of this normal-izing constant usually becomes apparent in analysis

To perform Bayesian analysis the researcher mustchoose a prior distribution For this application of esti-mating a probability from binomially distributed datathe beta distribution serves as a suitable prior distribu-tion for p The beta is a flexible two-parameter distribu-tion see Figure 6 for examples In contrast to the normaldistribution which has nonzero density on all numbersthe beta has nonzero density between 0 and 1 The betadistribution is a function of two parameters a and b thatdetermine its shape The notation for specifying a betadistribution prior for p is

p ~ Beta(a b)

The density of a beta random variable is given by

(18)

The denominator Be(a b) is termed the beta function1

Like f (y) it is not a function of p and plays the role of anormalizing constant in analysis

In practice researchers must completely specify theprior distribution before analysis that is they must choosesuitable values for a and b This choice reflects re-searchersrsquo beliefs about the possible values of p beforedata are collected When a 1 and b 1 (see the mid-

dle panel of Figure 6) the beta distribution is flat that isthere is equal density for all values in the interval [0 1]By choosing a b 1 the researcher is committing toall values of p being equally likely before data are col-lected Researchers can choose values of a and b to bestmatch their beliefs about the expected data For exam-ple consider a psychophysical experiment in which thevalue of p will almost surely be greater than 5 For thisexperiment priors with a b are appropriate

The goal is to derive the posterior distribution usingBayesrsquos theorem (Equation 15) Substituting the likeli-hood function (Equation 17) and prior (Equation 18) intoBayesrsquos theorem yields

Collecting terms in p yields

f ( p | y) kp (ya1)(1p)(Nyb1) (19)

where

The term k is constant with respect to p and serves as anormalizing constant Substituting

aprime y a bprime N y b (20)

into Equation 19 yields

f ( p | y) kp aprime1(1p)bprime1

This equation is proportional to a beta density for pa-rameters aprime and bprime To make f ( p | y) a proper probabilitydensity k has a value such that f ( p | y) integrates to 1

k a b f yN

y= times

⎛⎝⎜

⎞⎠⎟

minus[ ( ) ( )] Be 1

f p yN

y

p p p pa b

y N y a b

( | )( ) ( )

( =

⎛⎝⎜

⎞⎠⎟

minus minusminus minus minus1 11 1

Be )) ( )

times f y

f pp p

a b

a b

( )( )

( )= minusminus minus1 11

Be

Figure 5 Bias and root mean squared error (RMSE) for thethree estimators as functions of true probability of success Thesolid dashed and dasheddotted lines denote the characteristicsof p0 p1 and p2 respectively

Figure 6 Probability density function of the beta distributionfor various values of parameters a and b

HIERARCHICAL BAYESIAN MODELS 581

This occurs only if the value of k is a beta functionmdashforexample k 1Be(a prime bprime) Consequently

(21)

The posterior like the prior is a beta but with param-eters aprime and bprime given by Equation 20

The derivation above was greatly facilitated by sepa-rating those terms that depend on the parameter of inter-est from those that do not The latter terms serve as anormalizing constant and can be computed by ensuringthat the posterior integrates to 10 In most derivationsthe value of the normalizing constant becomes apparentmuch as it did above Consequently it is convenient toconsider only those terms that depend on the parameterof interest and to lump the rest into a proportionality con-stant Hence a convenient form of Bayesrsquos theorem is

Posterior distribution Likelihood function Prior distribution

The symbol ldquordquo denotes proportionality and is read isproportional to This form of Bayesrsquos theorem is usedthroughout this article

Figure 7 shows both the prior and posterior of p forthe case of y 7 successes in N 10 trials The priorcorresponds to a beta with a b 1 and the observeddata are 7 successes in 10 trials The posterior in thiscase is a beta distribution with parameters aprime 8 andbprime 4 As can be seen the posterior is considerablymore narrow than the prior indicating that a large degreeof information has been gained from the data There aretwo valuable quantities derived from the posterior distri-bution the posterior mean and the 95 highest densityregion (also termed the 95 credible interval) The for-mer is the point estimate of p and the latter serves anal-ogously to a confidence interval These two quantitiesare also shown in Figure 7

For the case of a beta-distributed posterior the ex-pression for the posterior mean is

The posterior mean reduces to p1 if a b 5 it reducesto p2 if a b 1 Therefore estimators p1 and p2 aretheoretically justified from a Bayesian perspective Theestimator p0 may be obtained for values of a b 0 Forthese values however the integral of the prior distribu-tion is infinite If the integral of a distribution is infinitethe distribution is termed improper Conversely if the integral is finite the distribution is termed proper InBayesian analysis it is essential that posterior distribu-tions be proper However it is not necessary for the priorto be proper In some cases an improper prior will stillyield a proper posterior When this occurs the analysisis perfectly valid For estimation of p the posterior isproper even when a b 0

When the prior and posterior are from the same distri-bution the prior is termed conjugate The beta distribu-tion is the conjugate prior for binomially distributed databecause the resulting posterior is also a beta distributionConjugate priors are convenient they often facilitate thederivation of posterior distributions Conjugacy howeveris not at all necessary for Bayesian analysis A discussionabout conjugacy is provided in the General Discussion

BAYESIAN ESTIMATION WITH NORMALLY DISTRIBUTED DATA

In this section we present Bayesian analysis for nor-mally distributed data The normal plays a critical role inthe hierarchical signal detection model We will assumethat participant and item effects on sensitivity and biasare distributed as normals The results and techniquesdeveloped in this section are used directly in analyzingthe subsequent hierarchical signal detection models

Consider the case in which a sequence of observa-tions w (w1 wN) are independent and identicallydistributed as normals

The goal is to estimate parameters μ and σ 2 In this sec-tion we discuss a more limited goalmdashthe estimation ofone of the parameters assuming that the other is knownIn the next section we will introduce a form of Markovchain Monte Carlo sampling for estimation when bothparameters are unknown

Estimating μμ With Known σσ 2

Assume that σ 2 is known and that the estimation of μis of primary interest The goal then is to derive the pos-

wjind~ Normal μ σ 2( )

p aa b

y aN a b

= primeprime + prime

= ++ +

f p yp p

a b

a b

( | )( )( )

= minusprime prime

primeminus primeminus1 11Be

Figure 7 Prior and posterior distributions for 7 successes outof 10 trials Both are distributed as betas Parameters for theprior are a 1 and b 1 parameters for the posterior are aprimeprime 8 and b primeprime 4 The posterior mean and 95 highest density regionare also indicated

582 ROUDER AND LU

terior f (μ |σ 2 w) According to the proportional formof Bayesrsquos theorem

All terms are conditioned on σ 2 to reflect the fact that itis known

The first step is to choose a prior distribution for μThe normal is a suitable choice

Parameters μ0 and σ 02 must be specified before analysis

according to the researcherrsquos beliefs about μ By choos-ing σ 0

2 to be increasingly large the researcher can makethe prior arbitrarily variable In the limit the prior ap-proaches having equal density across all values Thisprior is considered noninformative for μ

The density of the prior is given by

Note that the prior on μ does not depend on σ 2 there-fore f (μ) f (μ |σ2)

The next step is multiplying the likelihood f (w | μσ 2) by the prior density f (μ |σ 2) For normally distrib-uted data the likelihood is given by

(22)

Multiplying the equation above by the prior yields

We concern ourselves with only those terms that involveμ the parameter of interest Other terms are used to nor-malize the posterior density and are absorbed into theconstant of proportionality

where

(23)

and

(24)

The posterior is proportional to the density of a normalwith mean and variance given by μprime and σ 2prime respectivelyBecause the conditional posterior distribution must inte-grate to 10 this distribution must be that of a normalwith mean μprime and variance σ 2prime

(25)Unfortunately the notation does not make it clear thatboth μprime and σ 2prime depend on the value of σ 2 When this de-pendence is critical it will be made explicit μprime(σ 2) andσ 2prime(σ 2)

Because the posterior and prior are from the samefamily the normal distribution is the conjugate prior forthe population mean with normally distributed data (withknown variance) In the limit that the prior is diffuse (ieas its variance is made increasingly large) the posteriorof μ is a normal with μprime w and σ 2prime σ 2N This dis-tribution is also the sampling distribution of the samplemean in classical statistics Hence for the diffuse priorthe Bayesian and conventional approaches yield the sameresults

Figure 8 shows the estimation for an example in whichthe data are w (112 106 104 111) and the variance isassumed known to be 16 The mean of these four obser-vations is 10825 For demonstration purposes the prior

μ σ μ σ| ~ 2 2w Normal prime( )prime

prime = +⎛⎝⎜

⎞⎠⎟

minus minus primesumμ σ σ μ σ20

20

2w jj

σ σ σprime minus minus minus= +( )2 2

02 1

N

f μ σμ μ

σ| exp 2

2

22w( ) minus minus prime( )⎛

⎝⎜⎜

⎠⎟⎟prime

f N( | ) expμ σσ σ

μ22

02

212

1w minus +⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

⎣⎢⎢

⎝⎜

minusminussum

+⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

+sum

+⎧⎨

22

0

02

2

202

02

w

w

jj

jj

σμσ

μ

σμσ

⎪⎪

⎩⎪

⎫⎬⎪

⎭⎪

⎦⎥⎥

⎠⎟

fw wj jjμ σ

μ μ

σ

μ

| exp22 2

212

2w( ) minus

minus +( )⎡

⎣⎢⎢

⎝⎜⎜

+

sum

220 0

2

02

2minus + ⎤

⎦⎥⎥

⎠⎟

μ μ μσ

fw j

j

μ σμ

σ

μ μ

| exp

exp

2

2

2

0

2w( ) minus minus( )⎛

⎜⎜

⎟⎟

timesminus minus

sum

(( )⎛

⎝⎜⎜

⎠⎟⎟

2

022σ

fw

N

j

j

μ σπσ

μ

σ| exp

2

2 2

2

21

2 2w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

timesminus minus( )⎛

⎝⎜⎜

⎠⎟⎟

1

2 20

0

2

02π σ

μ μ

σexp

fw

N

j

j

w | exp

μ σπσ

μ

σ2

2 2

2

21

2 2( ) =

( )minus minus( )⎛

⎜⎜

⎟sum ⎟⎟

f ( ) exp μπσ

μ μσ

=minus minus( )⎛

⎝⎜⎜

⎠⎟⎟

1

2 20

0

2

02

μ μ σ~ Normal 0 02( )

f f fμ σ μ σ μ σ| | | 2 2 2w w( ) ( ) ( )

HIERARCHICAL BAYESIAN MODELS 583

is constructed to be fairly informative with μ0 100and σ 0

2 100 The posterior is narrower than the priorreflecting the information gained from the data It is cen-tered at μprime 1079 with variance 385 Note that theposterior mean is not the sample mean but is modestlyinfluenced by the prior Whether the Bayesian or the con-ventional estimate is better depends on the accuracy ofthe prior For cases in which much is known about thedependent measure it may be advantageous to reflectthis information in the prior

Estimating σσ 2 With Known μμIn the last section the posterior for μ was derived for

known σ 2 In this section it is assumed that μ is knownand that the estimation of σ 2 is of primary interest Thegoal then is to estimate the posterior f (σ 2 | μ w) The inverse-gamma distribution is the conjugate prior for σ 2

with normally distributed data The inverse-gamma dis-tribution is not widely used in psychology As its nameimplies it is related to the gamma distribution If a ran-dom variable g is distributed as a gamma then 1g is dis-tributed as an inverse gamma An inverse-gamma prioron σ 2 has density

The parameters of the inverse gamma are a and b and inapplication these are chosen beforehand As a and b ap-proach 0 the inverse gamma approaches 1σ 2 which isconsidered the appropriate noninformative prior for σ 2

(Jeffreys 1961)

Multiplying the inverse-gamma prior by the likelihoodgiven in Equation 22 yields

Collecting only those terms dependent on σ 2 yields

where

aprime N2 a (26)and

(27)

The posterior is proportional to the density of an inversegamma with parameters aprime and bprime Because this posteriorintegrates to 10 it must also be an inverse gamma

σ 2 | μ w ~ Inverse Gamma (aprime bprime) (28)

Note that posterior parameter bprime depends explicitly onthe value of μ

MARKOV CHAIN MONTE CARLO SAMPLING

The preceding development highlights a problem Theuse of conjugate priors allowed for the derivation of pos-teriors only when some of the parameters were assumedto be known The posteriors in Equations 25 and 28 areknown as conditional posteriors because they depend onother parameters The quantities of primary interest arethe marginal posteriors μ | w and σ 2 | w The straightfor-

prime = minus( )⎡⎣⎢

⎤⎦⎥

+sumb w bj j μ2

2

f ba

σ μσ σ

2

2 1 21| exp w( )

( )minus prime⎛

⎝⎜⎞⎠⎟prime+

f

w b

N a

jj

σ μσ

μ

σ

2

2 2 1

2

2

1

5

|

exp

w( )

( )times minus

minus( ) +⎛

+ +

sum

⎝⎝

⎜⎜

⎟⎟

fw

N

j

j

σ μσ

μ

σ2

2 2

2

21

2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟⎟

times

sum

112 1 2

σ σ( )minus⎛

⎝⎜⎞⎠⎟+a

bexp

fw

N

j

j

σ μπσ

μ

σ2

2 2

2

21

2 2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

times( )

minus⎛⎝⎜

⎞⎠⎟+

b

a

ba

aΓ( )

exp σ σ2 1 2

f b

a

b a ba

σ σσ2

2 1 22 0( ) =

( )minus⎛

⎝⎜⎞⎠⎟

ge+

Γ( )exp

Figure 8 Prior and posterior distributions for estimating themean of normally distributed data with known variance Theprior and posterior are the dashed and solid distributions re-spectively The vertical line shows the sample mean The priorldquopullsrdquo the posterior slightly downward in this case

584 ROUDER AND LU

ward method of obtaining marginals is to integrate conditionals

In this case the integral is tractable but in many cases itis not When an integral is symbolically intractable how-ever it can often still be evaluated numerically One ofthe recent developments in numeric integration is Markovchain Monte Carlo (MCMC) sampling MCMC sam-pling has become quite popular in the last decade and isdescribed in detail in Gelfand and Smith (1990) We showhere how MCMC can be used to estimate the marginalposterior distributions of μ and σ 2 without assumingknown values We use a particular type of MCMC sam-pling known as Gibbs sampling (Geman amp Geman 1984)Gibbs sampling is a restricted form of MCMC the moregeneral form is known as MetropolisndashHastings MCMC

The goal is to find the marginal posterior distribu-tions MCMC provides a set of random samples from themarginal posterior distributions (rather than a closed-form derivation of the posterior density or cumulativedistribution function) Obtaining a set of random sam-ples from the marginal posteriors is sufficient for analy-sis With a sufficiently large sample from a random vari-able one can compute its density mean quantiles andany other statistic to arbitrary precision

To explain Gibbs sampling we will digress momen-tarily using a different example Consider the case ofgenerating samples from an ex-Gaussian random vari-able The ex-Gaussian is a well-known distribution incognitive psychology (Hohle 1965) It is the sum of nor-mal and exponential random variables The parametersare μ σ and τ the location and scale of the normal andthe scale of the exponential respectively An ex-Gaussianrandom variable x can be expressed in conditional form

andη ~ Exponential(τ )

We can take advantage of this conditional form to gen-erate ex-Gaussian samples First we generate a set ofsamples from η Samples will be denoted with squarebrackets [η]1 denotes the first sample [η]2 denotes thesecond and so on These samples can then be used in theconditional form to generate samples from the marginalrandom variable x For example the first two samples aregiven by

and

In this manner a whole set of samples from the mar-ginal x can be generated from the conditional randomvariable x |η Figure 9 shows that the method works Ahistogram of [x] generated in this manner converges tothe true ex-Gaussian density This is an example of MonteCarlo integration of a conditional density

We now return to the problem of deriving marginalposterior distributions of μ | w from the conditionalμ |σ 2 w If there was a set of samples from σ 2 | w MonteCarlo integration could be used to sample μ | w Like-wise if there was a set of samples of μ | w Monte Carlointegration could be used to sample σ 2 | w In Gibbs sam-pling these relations are used iteratively as followsFirst a value of [σ 2]1 is picked arbitrarily This value isthen used to sample the posterior of μ from the condi-tional in Equation 25

where μprime and σ 2 prime are the explicit functions of σ 2 givenin Equations 23 and 24 Next the value of [μ]1 is used tosample σ 2 from the conditional posterior in Equation 28In general for iteration m

(29)

and

(30)

In this manner sequences of random samples [μμ] and[σσ 2] are obtained

The initial samples of [μμ] and [σσ 2] certainly reflect thechoice of starting values [σ 2]1 and are not samples fromthe desired marginal posteriors It can be shown how-ever that under mild technical conditions the later sam-ples of [μμ] and [σσ 2] are samples from the desired mar-ginal posteriors (Tierney 1994) The initial region in

σ μ21

⎡⎣ ⎤⎦ prime prime ⎡⎣ ⎤⎦( )( )minusm ma b~ Inverse Gamma

[ ] μ μ σ σ σm m m~ Normal prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime2 2 2

μ μ σ σ σ⎡⎣ ⎤⎦ prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime1

2 2~ Normal1

2

1

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )2 22~ Normal +μ η σ

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )1 12~ Normal +μ η σ

x | ~ η μ η σNormal +( )2

f f f d( | ) | | μ μ σ σ σw w w= ( ) ( )int 2 2 2

Figure 9 An example of Monte Carlo integration of a normalconditional density to yield ex-Gaussian distributed samples Thehistogram presents the samples from a normal whose mean pa-rameter is distributed as an exponential The line is the densityof the appropriate ex-Gaussian distribution

HIERARCHICAL BAYESIAN MODELS 585

which the samples reflect the starting value is termed theburn-in period The later samples are the steady-state pe-riod Formally speaking the samples form irreducibleMarkov chains with stationary distributions that equalthe marginal posterior distribution (see Gilks Richard-son amp Spiegelhalter 1996) On a more informal levelthe first set of samples have random and arbitrary valuesAt some point by chance the random sample happens tobe from a high-density region of the true marginal pos-terior From this point forward the samples are from thetrue posteriors and may be used for estimation

Let m0 denote the point at which the samples are ap-proximately steady state Samples after m0 can be used toprovide both posterior means and credible regions of theposterior (as well as other statistics such as posteriorvariance) Posterior means are estimated by the arith-metic means of the samples

and

where M is the number of iterations Credible regions areconstructed as the centermost interval containing 95of the samples from m0 to M

To use MCMC to estimate posteriors of μ and σ 2 it isnecessary to sample from a normal and an inverse-gamma

distribution Sampling from a normal is provided in manysoftware packages and routines for low-level languagessuch as C or Fortran are readily available Samples froman inverse-gamma distribution may be obtained by tak-ing the reciprocal of samples from a gamma distributionAhrens and Dieter (1974 1982) provide algorithms forsampling the gamma distribution

To illustrate Gibbs sampling consider the case of es-timating IQ The hypothetical data for this example arefour replicates from a single participant w (112 106104 111) The classical estimation of μ is the samplemean (10825) and confidence intervals are constructedby multiplying the standard error (sw193) by the ap-propriate critical t value with three degrees of freedomFrom this multiplication the 95 confidence intervalfor μ is [1022 1144] Bayesian estimation was donewith Gibbs sampling Diffuse priors were placed on μ(μ0 0 σ 0

2 106) and σ 2 (a 106 b 106) Twochains of [μμ] are shown in the left panel of Figure 10each corresponding to a different initial value for [σ 2]1As can be seen both chains converge to their steady staterather quickly For this application there is very littleburn-in The histogram of the samples of [μμ] after burn-in is also shown (right panel) The histogram conforms toa t distribution with three degrees of freedom The poste-rior mean is 10826 and the 95 credible interval is[1022 1144] For normally distributed data the diffusepriors used in Bayesian analysis yield results that are nu-merically equivalent to their frequentist counterparts

The diffuse priors used in the previous case representthe situation in which researchers have no previous knowl-

σσ

2

2

0

0=

⎡⎣ ⎤⎦

minusgtsum

mm m

M m

μμ

=minus

gtsum [ ]m

m m

M m0

0

15

10

05

0

120

115

110

105

100

0 10 20 30 40 50 95 100 105 110 115 120IQ PointsIteration

Den

sity

IQ points

micro

Burn-in

Figure 10 Samples of μμ from Markov chain Monte Carlo integration for normally dis-tributed data The left panel shows two chains each from a poor choice of the initial value ofσσ 2 (the solid and dotted lines are for [σσ 2]1 10000 and [σσ 2]1 0001 respectively) Con-vergence is rapid The right panel shows the histogram of samples of μμ after burn-in Thedensity is the appropriate frequentist t distribution with three degrees of freedom

586 ROUDER AND LU

edge In the IQ example however much is known aboutthe distribution of IQ scores Suppose our test was normedfor a mean of 100 and a standard deviation of 15 Thisknowledge could be profitably used by setting μ0 100and σ 0

2 152 225 We may not know the particulartestndashretest correlation of our scales but we can surmisethat the standard deviations of replicates should be lessthan the population standard deviation After experiment-ing with the values of a and b we chose a 3 and b 100 The prior on σ 2 corresponding to these values isshown in Figure 11 It has much of its density below 100reflecting our belief that the testndashretest variability issmaller than the variability in the population Figure 11also shows a histogram of the posterior of μ (the posteriorwas obtained with Gibbs sampling as outlined above[σ 2]1 1 burn-in of 100 iterations) The posterior meanis 10795 which is somewhat below the sample mean of10825 This difference is from the prior people on aver-age score lower than the obtained scores Although it is abit hard to see in the figure the posterior distribution ismore narrow in the extreme tails than is the correspond-ing t distribution The 95 credible interval is [10211136] which is modestly different from the correspond-ing frequentist confidence interval This analysis shows

that the use of reasonable prior information can lead to aresult different from that of the frequentist analysis TheBayesian estimate is not only different but often more ef-ficient than the frequentist counterpart

The theoretical underpinnings of MCMC guaranteethat infinitely long chains will converge to the true pos-terior Researchers must decide how long to burn in chainsand then how long to continue in order to approximatethe posterior well The most important consideration inthis decision is the degree of correlation from one sam-ple to another In MCMC sampling consecutive samplesare not necessarily independent of one another The valueof a sample on cycle m 1 is in general dependent onthat of m If this dependence is small then convergencehappens quickly and relatively short chains are neededIf this dependence is large then convergence is muchslower One informal graphical method of assessing thedegree of dependence is to plot the autocorrelation func-tions of the chains The bottom right panel of Figure 11shows that for the normal application there appears to beno autocorrelation that is there is no dependence fromone sample to the next This fact implies that conver-gence is relatively rapid Good approximations may beobtained with short burn-ins and relatively short chains

Figure 11 Analysis with informative priors The two top panels showthe prior distributions on μμ and σσ 2 The bottom left panel shows the his-togram of samples of μμ after burn-in It differs from a t distribution fromfrequentist analysis in that it is shifted toward 100 and has a smallerupper tail The bottom right panel shows the autocorrelation functionfor μμ There is no discernible autocorrelation

HIERARCHICAL BAYESIAN MODELS 587

There are a number of formal procedures for assessingconvergence in the statistical literature (see eg Gel-man amp Rubin 1992 Geweke 1992 Raftery amp Lewis1992 see also Gelman et al 2004)

A PROBIT MODEL FOR ESTIMATING A PROBABILITY OF SUCCESS

The next step toward a hierarchical signal detectionmodel is to revisit estimation of a probability from indi-vidual trials Once again y successes are observed in Ntrials and y ~ Binomial(N p) where p is the parameterof interest In the previous section a beta prior was placedon parameter p and the conjugacy of the beta with the bi-nomial directly led to the posterior Unfortunately a betadistribution prior is not convenient as a prior for the sig-nal detection application

The signal detection model relies on the probit trans-form Figure 12 shows the probit transform which mapsprobabilities into z values using the normal inverse cu-mulative distribution function Whereas p ranges from 0to 1 z ranges across the real number line Let Φ denotethe standard normal cumulative distribution function Withthis notation p Φ(z) and conversely z Φ1( p)

Signal detection parameters are defined in terms ofprobit transforms

and

where p(h) and p(f ) are hit and false alarm probabilities re-spectively As a precursor to models of signal detectionwe develop estimation of the probit transform of a proba-bility of success z Φ1( p) The methods for doing sowill then be used in the signal detection application

When using a probit transform we model the y suc-cesses in N trials as

y ~ Binomial[N Φ(z)]

where z is the parameter of interestWhen using a probit it is convenient to place a normal

prior on z

(31)

Figure 13 shows various normal priors on z and by trans-form the corresponding prior on p Consider the specialcase when a standard normal is placed on z As is shownin the figure the corresponding prior on p is flat A flatprior also corresponds to a beta with a b 1 (see Fig-ure 6) From the previous development if a beta prior isplaced on p the posterior is also a beta with parametersaprime a y and bprime b (N y) Noting for a flat priorthat a b 1 it is expected that the posterior of p is dis-tributed as a beta with parameters aprime 1 y and bprime 1 (N y)

The next step is to derive the posterior f (z |Y ) for anormal prior on z The straightforward approach is tonote that the posterior is proportional to the likelihoodmultiplied by the prior

This equation is intractable Albert and Chib (1995) pro-vide an alternative for analysis and the details of this al-ternative are provided in Appendix A The basic idea isthat a new set of latent variables is defined upon which zis conditioned Conditional posteriors for both z andthese new latent variables are easily derived and sam-pled These samples can then be used in Gibbs samplingto find a marginal posterior for z

In this application it may be desirable to estimate theposterior of p as well as that of z This estimation is easyin MCMC Because p is the inverse probit transform ofz samples can also be inversely transformed Let [z] bethe chain of samples from the posterior of z A chain ofposterior samples of p is given by [ p]m Φ( [z]m)

The top right panel of Figure 14 shows the histogramof the posterior of p for y 7 successes on N 10 tri-als The prior parameters were μ0 0 and σ 0

2 1 so theprior on p was flat (see Figure 13) Because the prior isflat it corresponds to a beta with parameters (a 1 b 1) Consequently the expected posterior is a beta withdensity aprime y 1 8 and bprime (N y) 1 4 Thechains in the Gibbs sampler were started from a numberof values of [z]1 and convergence was always rapid Theresulting histogram for 10000 samples is plotted it ap-proximates the expected distribution The bottom panelshows a small degree of autocorrelation This autocorre-lation can be overcome by running longer chains andusing a more conservative burn-in period Chains of

f z Y z z

z

Y N Y( | ) ( ) [ ( )]

exp

Φ Φ1

20

2

02 1

minus

times minus minus( ) ( )minus

minusμ σ⎡⎡

⎣⎢⎤⎦⎥

z ~ Normal 0μ σ02( )

c p= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pΦ Φ1 1( ) ( )h f

Figure 12 Probit transform of probability ( p) into z scores (z)The transform is the inverse cumulative distribution function ofthe normal distribution Lines show the transform of p 69 intoz 05

588 ROUDER AND LU

1000 samples with 100 samples of burn-in are morethan sufficient for this application

ESTIMATING PROBABILITY FORSEVERAL PARTICIPANTS

In this section we propose a model for estimating sev-eral participantsrsquo probability of success The main goalis to introduce hierarchical models within the context ofa relatively simple example The methods and insightsdiscussed here are directly applicable to the hierarchicalmodel of signal detection Consider data from a numberof participants each with his or her own unique proba-bility of success pi In conventional estimation each ofthese probabilities is estimated separately usually by theestimator piyi Ni where yi and Ni are the number ofsuccesses and trials respectively for the ith participantWe show here that hierarchical modeling leads to moreaccurate estimation

In a hierarchical model it is assumed that although in-dividuals vary in ability the distribution governing indi-vidualsrsquo true abilities is orderly We term this distributionthe parent distribution If we knew the parent distributionthis prior information would be useful in estimationConsider the example in Figure 15 Suppose 10 individ-uals each performed 10 trials Hypothetical performanceis indicated with Xs Letrsquos suppose the task is relativelyeasy so that individualsrsquo true probabilities of successrange from 7 to 10 as indicated by the curve in the fig-

ure One participant does poorly however succeeding ononly 3 trials The conventional estimate for this poor-performing participant is 3 which is a poor estimate ofthe true probability (between 7 and 10) If the parentdistribution is known it would be evident that this poorscore is influenced by a large degree of sample noise andshould be corrected upward as indicated by the arrowThe new estimate which is more accurate is denoted byan H (for hierarchical)

In the example above we assumed a parent distribu-tion In actuality such assumptions are unwarranted Inhierarchical models parent distributions are estimatedfrom the data these parents in turn affect the estimatesfrom outlying data The estimated parent for the data inFigure 15 would have much of its mass in the upperranges When serving as a prior the estimated parentwould exert an upward tendency in the estimation of theoutlying participantrsquos probability leading to better esti-mation In Bayesian analysis the method of implement-ing a hierarchical model is to use a hierarchical priorThe parent distribution serves as a prior for each indi-vidual and this parent is termed the first stage of theprior The parent distribution is not fully specified butinstead free parameters of the parent are estimated fromthe data These parameters also need a prior on themwhich is termed the second stage of the prior

The probit transform model is ideally suited for a hi-erarchical prior A normally distributed parent may beplaced on transformed probabilities of success The ith

Prior 1 Prior 2 Prior 3

Priorson z

Corresponding Priors on p

N(01) N(11) N(0 2)4

3

2

1

0

4

3

2

1

0

4

3

2

1

0

ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6

5

4

3

2

1

0

5

4

3

2

1

0

5

4

3

2

1

0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Figure 13 Three normally distributed priors on z and the corresponding priors on p Priors on z were obtainedby drawing 100000 samples from the normal Priors on p were obtained by transforming each sample of z

HIERARCHICAL BAYESIAN MODELS 589

participantrsquos true ability zi is simply a sample from thisnormal The number of successes is modeled as

(32)

with a parent distribution (first-stage prior) given by

(33)

In this case parameters μ and σ 2 describe the locationand variance of the parent distribution and reflect group-level properties Priors are needed on μ and σ 2 (second-stage priors) Suitable choices are

(34)

andσ 2 ~ Inverse Gamma (a b) (35)

Values of μ0 σ 02 a and b must be specif ied before

analysisWe performed an analysis on hypothetical data in order

to demonstrate the hierarchical model Table 2 shows hy-pothetical data for a set of 20 participants performing 50trials These success frequencies were generated from abinomial but each individual had a unique true proba-bility of success These true probabilities2 which rangedbetween 59 and 80 are shown in the second column ofTable 2 Parameter estimators p0 (Equation 12) and p2(Equation 14) are shown

All of the theoretical results needed to derive the con-ditional posterior distributions have already been pre-

sented The Albert and Chib (1995) analysis for this ap-plication is provided in Appendix B A coded version ofMCMC for this model is presented in Appendix C

The MCMC chain was run for 10000 iterations Pri-ors were set to be fairly diffuse (μ0 0 σ 0

2 1000 a b 1) All parameters converged to steady-state behav-ior quickly (burn-in was conservative at 100 iterations)Autocorrelation of samples of pi for 2 select participantsis shown in the bottom panels of Figure 16 The auto-correlation in these plots was typical of other param-eters Autocorrelation for this model was no worse thanthat for the nonhierarchical Bayesian model of the pre-ceding section The top left panel of Figure 16 showsposterior means of the probability of success as a func-tion of the true probability of success for the hierarchi-cal model (filled circles) Also included are classical estimates from p0 (filled circles) Points from the hier-archical model are on average closer to the diagonalAs is indicated in Table 2 the hierarchical model haslower root mean squared error than does its frequentistcounterpart p0 The top right panel shows the hierarchi-cal estimates as a function of p0 The effect of the hier-archical structure is clear The extreme estimates in thefrequentist approach are moderated in the Bayesiananalysis

Estimator p2 also moderates extreme values The ten-dency is to pull them toward the middle specifically es-timates are closer to 5 than they are with p0 For the dataof Table 2 the difference in efficiency between p2 andp0 is exceedingly small (about 0001) The reason thatthere was little gain for p2 is that it pulled estimates to-ward a value of 5 which is not the mean of the trueprobabilities The mean was p 69 and for this valueand these sample sizes estimators p2 and p0 have aboutthe same efficiency The hierarchical estimator is supe-

μ μ σ~ Normal 02( )

zi | μ σ μ σ2 ind~ Normal 2( )

y z N zi i i i| ind~ Binomial Φ( )⎡⎣ ⎤⎦

Figure 14 (Top) Histograms of the post-burn-in samples of zand p respectively for 7 successes out of 10 trials The line fit forp is the true posterior a beta distribution with a primeprime 8 and bprimeprime 4(Bottom) Autocorrelation function for z There is a small degreeof autocorrelation

Figure 15 Example of how a parent distribution affects esti-mation The outlying observation is incompatible with the parentdistribution Bayesian estimation allows for the influence of theparent as a prior The resulting estimate is shifted upward re-ducing error

590 ROUDER AND LU

rior to these nonhierarchical estimators because it pullsoutlying estimates toward the mean value p Of coursethe advantage of the hierarchical model is a function ofsample size With increasing sample sizes the amountof pulling decreases In the limit all three estimatorsconverge to the true individualrsquos probability

A HIERARCHICAL SIGNAL DETECTION MODEL

Our ultimate goal is a signal detection model that ac-counts for both participant and item variability As a pre-cursor we present a hierarchical signal detection modelwithout item variability the model will be expanded inthe next section to include item variability The modelhere is applicable to psychophysical experiments in whichsignal and noise trials are identical replicates respectively

Signal detection parameters are simple subtractions ofprobit-transformed probabilities Let d primei and ci denotesignal detection parameters for the ith participant

and

where pi(h) and p

i(f ) are individualsrsquo true hit and false

alarm probabilities respectively Let hi and fi be the pro-bit transforms hi1( p i

(h)) and fi1( pi(f )) Then

sensitivity and criteria are given by

d primei hi fiand

ci fi

The tactic we follow is to place hierarchical models on hand f rather than on signal detection parameters directlyBecause the relationship between signal detection pa-rameters and probit-transformed probabilities is linearaccurate estimation of hi and fi results in accurate esti-mation of d primei and ci

The model is developed as follows Let Ni and Si de-note the number of noise and signal trials given to the ithparticipant Let y

i(h) and y

i(f ) denote the the number of

hits and false alarms respectively The model for yi(h) and

yi(f ) is given by

and

It is assumed that each participantrsquos parameters aredrawn from parent distributions These parent distribu-tions serve as the first stage of the hierarchical prior andare given by

(36)

and

(37)

Second-stage priors are placed on parent distribution pa-rameters (μk σ k

2) as

(38)

and

(39)σ k k ka b k2 ~ Inverse Gamma h f( ) =

μk k ku v k~ Normal h f( ) =

f iind~ Normal f fμ σ 2( )

hiind~ Normal h hμ σ 2( )

y f N fi i i i( ) | f ind

~ Binomial Φ( )⎡⎣ ⎤⎦

y h S hi i i i( ) | h ind

~ Binomial Φ( )⎡⎣ ⎤⎦

c pi i= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pi i iΦ Φ1 1( ) ( )h f

Table 2Data and Estimates of a Probability of Success

Participant True Prob Data p0 p2 Hierarchical

1 587 33 66 654 6722 621 28 56 558 6153 633 34 68 673 6834 650 29 58 577 6275 659 32 64 635 6606 665 37 74 731 7167 667 30 60 596 6388 667 29 58 577 6279 687 34 68 673 684

10 693 34 68 673 68211 696 34 68 673 68312 701 37 74 731 71513 713 38 76 750 72814 734 38 76 750 72715 736 37 74 731 71716 738 33 66 654 67217 743 37 74 731 71618 759 33 66 654 67219 786 38 76 750 72820 803 42 84 827 771

RMSE 053 053 041Data are numbers of successes in 50 trials

HIERARCHICAL BAYESIAN MODELS 591

Analysis takes place in two steps The first is to obtainposterior samples from hi and fi for each participantThese are probit-transformed parameters that may be es-timated following the exact procedure in the previoussection Separate runs are made for signal trials (produc-ing estimates of hi) and for noise trials (producing esti-mates of fi) These posterior samples are denoted [hi] and[ fi] respectively The second step is to use these posteriorsamples to estimate posterior distributions of individuald primei and ci This is easily accomplished For all samples

(40)

and(41)

Hypothetical data and parameter estimates are shownin Table 3 Individual variability in the data was imple-mented by sampling each participantrsquos d prime and c from alog-normal and from a normal distribution respectivelyTrue d prime values are shown and they range from 123 to234 From these true values of d prime and c true individualhit and false alarm probabilities were computed and in-dividual hit and false alarm frequencies were sampledfrom a binomial Hit and false alarm frequencies wereproduced in this manner for 20 hypothetical individualseach observing 50 signal and 50 noise trials

A conventional sensitivity estimate for each individ-ual was obtained using the formula

Φ1(Hit rate) Φ1(False alarm rate)

The resulting estimates are shown in the last column ofTable 3 labeled Conventional The hierarchical signaldetection model was analyzed with diffuse prior param-eters ah af bh bf 01 uh uf 0 and vh vf 1000 The chain was run for 10000 iterations with thefirst 500 serving as burn-in This choice of burn-in pe-riod is very conservative Autocorrelation for sensitivityfor 2 select participants is displayed in the bottom pan-els of Figure 17 this degree of autocorrelation was typ-ical for all parameters Autocorrelation was about thesame as in the previous application and the resulting es-timates are shown in the next-to-last column of Table 3

The top left panel of Figure 17 shows a plot of esti-mated dprime as a function of true d prime Estimates from the hi-erarchical model are closer to the diagonal than are theconventional estimates indicating more accurate esti-mation Table 3 also provides efficiency information thehierarchical model is 33 more efficient than the con-ventional estimates (this amount is typical of repeatedruns of the same simulation) The reason for this gain isevident in the top right panel of Figure 17 The extremeestimates in the conventional method which most likely

c fi m i m⎡⎣ ⎤⎦ = minus ⎡⎣ ⎤⎦

prime⎡⎣ ⎤⎦ = ⎡⎣ ⎤⎦ minus ⎡⎣ ⎤⎦d h fi m i m m

ConventionalHierarchical

Participant 5 Participant 15

True Probability Conventional Estimates

Lag Lag

5 6 7 8 9 5 6 7 8 9

0 10 20 30 40 0 10 20 30 40

Aut

ocor

rela

tion

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

9

8

7

6

5

8

4

0

9

8

7

6

5

8

4

0

Figure 16 Modeling the probability of success across several partici-pants (Top left) Probability estimates as functions of true probabilitythe triangles represent estimator p0 and the circles represent posteriormeans of pi from the hierarchical model (Top right) Hierarchical esti-mates as a function of p0 (Bottom) Autocorrelation functions of the pos-terior probability estimates for 2 participants from the hierarchicalmodel The degree of autocorrelation is typical of the other participants

592 ROUDER AND LU

reflect sample noise are shrunk to more moderate esti-mates in the hierarchical model

SIGNAL DETECTION WITH ITEMVARIABILITY

In the introduction we stressed that Bayesian hierar-chical modeling may provide a solution for the statisti-cal problems associated with unmodeled variability innonlinear contexts In all of the previous examples thegains from hierarchical modeling were in better effi-ciency Conventional analysis although not as accurateas Bayesian analysis certainly yielded consistent if notunbiased estimates In this section item variability isadded to signal detection Conventional analyses inwhich scores are aggregated over items result in incon-sistent estimates characterized by asymptotic bias Inthis section we show that in contrast to conventionalanalysis Bayesian hierarchical models provide consis-tent and accurate estimation

Our route to the model is a bit circuitous Our first at-tempt based on a straightforward extension of the pre-vious techniques fails because of an excessive amountof autocorrelation The solution to this failure is to use apowerful and recently developed sampling schemeblocked sampling (Roberts amp Sahu 1997) First we willproceed with the straightforward but problematic exten-sion Then we will introduce blocked sampling and showhow it leads to tractable analysis

Straightforward ExtensionIt is straightforward to specify models that include

item effects The previous notation is here expanded byindexing hits and false alarms by both participants and

items Let yij(h) and yij

(f ) denote the response of the ith per-son to the jth item Values of yij

(h) and yij(f ) are dichoto-

mous indicating whether the response was old or newLet pij

(h) and pij(f ) be true probabilities that yij

(h) 1 andyij

(f ) 1 respectively that is these probabilities are truehit and false alarm rates for a given participant-by-itemcombination respectively Let hij and fij be the probittransforms [hij Φ1( pij

(h)) fij Φ1( pij(f ))] The model

is given by

(42)

and

(43)

Linear models are placed on hij and fij

hij μh αi γj (44)

and

fij μf βi δj (45)

Parameters μh and μf correspond to the grand means ofthe hit and false alarm rates respectively Parameters ααand ββ denote participant effects on hits and false alarms respectively parameters γγ and δδ denote item effects onhits and false alarms respectively These participant anditem effects are assumed to be zero-centered random ef-fects Grand mean signal detection parameters denotedd prime and c are given by

d prime μh μf (46)

and

c μf (47)

y fij ij( ) f ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

y hij ij( )h ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

Table 3Data and Estimates of Sensitivity

FalseParticipant True d prime Hits Alarms Hierarchical Conventional

1 1238 35 7 1608 16052 1239 38 17 1307 11193 1325 38 11 1554 14784 1330 33 16 1146 08805 1342 40 19 1324 11476 1405 39 8 1730 17677 1424 37 12 1466 13508 1460 27 5 1417 13829 1559 35 9 1507 1440

10 1584 37 9 1599 155911 1591 33 8 1486 140712 1624 40 7 1817 192213 1634 35 11 1427 129714 1782 43 13 1687 172415 1894 33 3 1749 196716 1962 45 12 1839 198817 1967 45 4 2235 268718 2124 39 6 1826 194719 2270 45 5 2172 256320 2337 46 6 2186 2580

RMSE 183 275Data are numbers of signal responses in 50 trials

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 6: Rouder & Lu (2005)

578 ROUDER AND LU

Bayesian analysis more tractable Markov chain MonteCarlo sampling In the first section basic issues of esti-mation are discussed Then this discussion is expandedto include the basics of Bayesian estimation Followingthat Markov chain Monte Carlo integration is intro-duced Finally a series of hierarchical signal detectionmodels is presented These models provide unbiased andrelatively efficient sensitivity estimates for the standardequal-variance signal detection model even in the pres-ence of both participant and item variability

ESTIMATION

It is easier to discuss Bayesian estimation in the con-text of a simple example Consider the case in which asingle participant performs N trials The result of eachtrial is either a success or a failure We wish to estimatethe underlying true probability of a success denoted pfor this single participant Estimating a probability ofsuccess is essential to many applications in cognitivepsychology including signal detection analysis In signaldetection we typically estimate hit and false alarm ratesHits are successes on old-item trials and false alarms arefailures on new-item trials

A formula that provides an estimate is termed an esti-mator A reasonable estimator for this case is

(12)

where y is the number of successes out of N trials Esti-mator p0 is natural and straightforward

To evaluate the usefulness of estimators statisticiansusually discuss three basic properties bias efficiencyand consistency Bias and efficiency are illustrated inTable 1 The data are the results of weighing a hypothet-ical person of 170 lb on two hypothetical scales four sep-arate times Bias refers to the mean of repeated estimatesScale A is unbiased because the mean of the estimatesequals the true value of 170 lb Scale B is biased Themean is 172 lb which is 2 lb greater than the true valueof 170 lb Scale B however has a smaller degree of errorthan does Scale A so Scale B is termed more efficientthan Scale A Efficiency is the inverse of the expectederror of an observation and may be indexed by the reci-procal of root mean squared error Bias and efficiencyhave the same meaning for estimators as they do forscales Bias refers to the difference between the averagevalue of an estimator and a true value Efficiency refers

to the average magnitude of the difference between anestimator and a true value In many situations efficiencydetermines the quality of inference more than bias does

How biased and efficient is estimator p0 To providea context for evaluation consider the following two al-ternative estimators

(13)

and

(14)

These two alternatives may seem unprincipled but as isdiscussed in the next section they are justif ied in aBayesian framework

Figure 4 shows sampling distributions for the threeprobability estimators when the true probability is p 7and there are N 10 trials Estimator p0 is not biasedbut estimators p1 and p2 are Surprisingly estimator p2has the lowest average error that is it is the most effi-cient The reason is that the tails of the sampling distri-bution are closer to the true value of p 7 for p2 thanfor p1 or p0 Figure 4 shows the case for a single truevalue of p 7 Figure 5 shows bias and efficiency forall three estimators for the full range of p The conven-tional estimator p0 is unbiased for all true values of pbut the other two estimators are biased for extreme prob-abilities None of the estimators is always more efficientthan the others For intermediate probabilities estimatorp2 is most efficient for extreme probabilities estimatorp0 is most efficient Typically researchers have someidea of what type of probability of success to expect intheir experiments This knowledge can therefore be usedto help pick the best estimator for a particular situation

The other property of estimators is consistency Aconsistent estimator converges to its true value As thesample size is increased the estimator not only becomesunbiased the overall variance shrinks toward 0 Consis-tency should be viewed as a necessary property of a goodestimator Fortunately many estimators used in psychol-ogy are consistent including sample means sample vari-ances and sample correlation coefficients The three es-timators p0 p1 and p2 are consistent with sufficientdata they converge to the true value of p In fact all es-timators presented in this article are consistent

BAYESIAN ESTIMATION OF APROBABILITY

In this section we provide a Bayesian approach to es-timating the probability of success The datum in this ap-plication is the number of successes (y) and this quan-tity may be modeled as a binomial

y ~ Binomial(N p)

where N is known and p is the unknown parameter of interest

ˆ py

N212

= ++

ˆ p

yN1

51

= ++

ˆ pyN0 =

Table 1Weight of a 170-Pound Person Measured on

Two Hypothetical Scales

Scale A Scale B

Data (lb) 180 160 175 165 174 170 173 171Mean 170 172Bias 0 20RMSE 791 255

HIERARCHICAL BAYESIAN MODELS 579

Bayesrsquos (1763) insightful contribution was to use thelaw of conditional probability to estimate the parameterp He noted that

(15)

We describe each of these terms in orderThe main goal is to find f ( p | y) the quantity on the

left-hand side of Equation 15 The term f( p | y) is the dis-tribution of the parameter p given the data y This distri-bution is referred to as the posterior distribution It isviewed as a function of the parameter p The mean of theposterior distribution termed the posterior mean servesas a suitable point estimator for p

The term f(y | p) plays two roles in statistics depend-ing on context When it is viewed as a function of y forknown p it is known as the probability mass functionThe probability mass function describes the probabilityof any outcome y given a particular value of p For a bi-nomial random variable the probability mass function isgiven by

(16)

For example this function can be used to compute theprobability of observing a total of two successes on threetrials when p 5

38

The term f(y | p) can also be viewed as a function ofparameter p for fixed y In this case it is termed the like-lihood function and describes the likelihood of a partic-ular parameter value p given a fixed value of y For thebinomial the likelihood function is given by

(17)

In Bayesrsquos theorem we are interested in the posterior asa function of the parameter p Hence the right-handterms of Equation 15 are viewed as a function of param-eter p Therefore f (y | p) is viewed as the likelihood of prather than the probability mass of y

The term f (p) is the prior distribution and reflects theexperimenterrsquos a priori beliefs about the true value of p

Finally the term f (y) is the distribution of the datagiven the model Although its interpretation is importantin some contexts it plays a minimal role in the develop-

f y pN

yp p py N y( | ) ( ) =

⎛⎝⎜

⎞⎠⎟

minus le leminus1 0 1

f y p( | ) ( ) ( )= = =⎛⎝⎜

⎞⎠⎟

minus2 53

25 1 52 1

f y p

N

yp p y Ny N y

( | )( )

=⎛⎝⎜

⎞⎠⎟

minus =minus1 0

0 otherwiise

⎧⎨⎪

⎩⎪

f p yf y p f p

f y( | )

( | ) ( )( )

=

Figure 4 Sampling distributions of p0 p1 and p2 for N 10 trialswith a p 7 true probability of success on any one trial Bias and rootmean squared error (RMSE) are included

580 ROUDER AND LU

ment presented here for the following reason The pos-terior is a function of the parameter of interest p butf (y) does not depend on p The term f (y) is a constantwith respect to p and serves as a normalizing constantthat ensures that the posterior density integrates to 1 Aswill be shown in the examples the value of this normal-izing constant usually becomes apparent in analysis

To perform Bayesian analysis the researcher mustchoose a prior distribution For this application of esti-mating a probability from binomially distributed datathe beta distribution serves as a suitable prior distribu-tion for p The beta is a flexible two-parameter distribu-tion see Figure 6 for examples In contrast to the normaldistribution which has nonzero density on all numbersthe beta has nonzero density between 0 and 1 The betadistribution is a function of two parameters a and b thatdetermine its shape The notation for specifying a betadistribution prior for p is

p ~ Beta(a b)

The density of a beta random variable is given by

(18)

The denominator Be(a b) is termed the beta function1

Like f (y) it is not a function of p and plays the role of anormalizing constant in analysis

In practice researchers must completely specify theprior distribution before analysis that is they must choosesuitable values for a and b This choice reflects re-searchersrsquo beliefs about the possible values of p beforedata are collected When a 1 and b 1 (see the mid-

dle panel of Figure 6) the beta distribution is flat that isthere is equal density for all values in the interval [0 1]By choosing a b 1 the researcher is committing toall values of p being equally likely before data are col-lected Researchers can choose values of a and b to bestmatch their beliefs about the expected data For exam-ple consider a psychophysical experiment in which thevalue of p will almost surely be greater than 5 For thisexperiment priors with a b are appropriate

The goal is to derive the posterior distribution usingBayesrsquos theorem (Equation 15) Substituting the likeli-hood function (Equation 17) and prior (Equation 18) intoBayesrsquos theorem yields

Collecting terms in p yields

f ( p | y) kp (ya1)(1p)(Nyb1) (19)

where

The term k is constant with respect to p and serves as anormalizing constant Substituting

aprime y a bprime N y b (20)

into Equation 19 yields

f ( p | y) kp aprime1(1p)bprime1

This equation is proportional to a beta density for pa-rameters aprime and bprime To make f ( p | y) a proper probabilitydensity k has a value such that f ( p | y) integrates to 1

k a b f yN

y= times

⎛⎝⎜

⎞⎠⎟

minus[ ( ) ( )] Be 1

f p yN

y

p p p pa b

y N y a b

( | )( ) ( )

( =

⎛⎝⎜

⎞⎠⎟

minus minusminus minus minus1 11 1

Be )) ( )

times f y

f pp p

a b

a b

( )( )

( )= minusminus minus1 11

Be

Figure 5 Bias and root mean squared error (RMSE) for thethree estimators as functions of true probability of success Thesolid dashed and dasheddotted lines denote the characteristicsof p0 p1 and p2 respectively

Figure 6 Probability density function of the beta distributionfor various values of parameters a and b

HIERARCHICAL BAYESIAN MODELS 581

This occurs only if the value of k is a beta functionmdashforexample k 1Be(a prime bprime) Consequently

(21)

The posterior like the prior is a beta but with param-eters aprime and bprime given by Equation 20

The derivation above was greatly facilitated by sepa-rating those terms that depend on the parameter of inter-est from those that do not The latter terms serve as anormalizing constant and can be computed by ensuringthat the posterior integrates to 10 In most derivationsthe value of the normalizing constant becomes apparentmuch as it did above Consequently it is convenient toconsider only those terms that depend on the parameterof interest and to lump the rest into a proportionality con-stant Hence a convenient form of Bayesrsquos theorem is

Posterior distribution Likelihood function Prior distribution

The symbol ldquordquo denotes proportionality and is read isproportional to This form of Bayesrsquos theorem is usedthroughout this article

Figure 7 shows both the prior and posterior of p forthe case of y 7 successes in N 10 trials The priorcorresponds to a beta with a b 1 and the observeddata are 7 successes in 10 trials The posterior in thiscase is a beta distribution with parameters aprime 8 andbprime 4 As can be seen the posterior is considerablymore narrow than the prior indicating that a large degreeof information has been gained from the data There aretwo valuable quantities derived from the posterior distri-bution the posterior mean and the 95 highest densityregion (also termed the 95 credible interval) The for-mer is the point estimate of p and the latter serves anal-ogously to a confidence interval These two quantitiesare also shown in Figure 7

For the case of a beta-distributed posterior the ex-pression for the posterior mean is

The posterior mean reduces to p1 if a b 5 it reducesto p2 if a b 1 Therefore estimators p1 and p2 aretheoretically justified from a Bayesian perspective Theestimator p0 may be obtained for values of a b 0 Forthese values however the integral of the prior distribu-tion is infinite If the integral of a distribution is infinitethe distribution is termed improper Conversely if the integral is finite the distribution is termed proper InBayesian analysis it is essential that posterior distribu-tions be proper However it is not necessary for the priorto be proper In some cases an improper prior will stillyield a proper posterior When this occurs the analysisis perfectly valid For estimation of p the posterior isproper even when a b 0

When the prior and posterior are from the same distri-bution the prior is termed conjugate The beta distribu-tion is the conjugate prior for binomially distributed databecause the resulting posterior is also a beta distributionConjugate priors are convenient they often facilitate thederivation of posterior distributions Conjugacy howeveris not at all necessary for Bayesian analysis A discussionabout conjugacy is provided in the General Discussion

BAYESIAN ESTIMATION WITH NORMALLY DISTRIBUTED DATA

In this section we present Bayesian analysis for nor-mally distributed data The normal plays a critical role inthe hierarchical signal detection model We will assumethat participant and item effects on sensitivity and biasare distributed as normals The results and techniquesdeveloped in this section are used directly in analyzingthe subsequent hierarchical signal detection models

Consider the case in which a sequence of observa-tions w (w1 wN) are independent and identicallydistributed as normals

The goal is to estimate parameters μ and σ 2 In this sec-tion we discuss a more limited goalmdashthe estimation ofone of the parameters assuming that the other is knownIn the next section we will introduce a form of Markovchain Monte Carlo sampling for estimation when bothparameters are unknown

Estimating μμ With Known σσ 2

Assume that σ 2 is known and that the estimation of μis of primary interest The goal then is to derive the pos-

wjind~ Normal μ σ 2( )

p aa b

y aN a b

= primeprime + prime

= ++ +

f p yp p

a b

a b

( | )( )( )

= minusprime prime

primeminus primeminus1 11Be

Figure 7 Prior and posterior distributions for 7 successes outof 10 trials Both are distributed as betas Parameters for theprior are a 1 and b 1 parameters for the posterior are aprimeprime 8 and b primeprime 4 The posterior mean and 95 highest density regionare also indicated

582 ROUDER AND LU

terior f (μ |σ 2 w) According to the proportional formof Bayesrsquos theorem

All terms are conditioned on σ 2 to reflect the fact that itis known

The first step is to choose a prior distribution for μThe normal is a suitable choice

Parameters μ0 and σ 02 must be specified before analysis

according to the researcherrsquos beliefs about μ By choos-ing σ 0

2 to be increasingly large the researcher can makethe prior arbitrarily variable In the limit the prior ap-proaches having equal density across all values Thisprior is considered noninformative for μ

The density of the prior is given by

Note that the prior on μ does not depend on σ 2 there-fore f (μ) f (μ |σ2)

The next step is multiplying the likelihood f (w | μσ 2) by the prior density f (μ |σ 2) For normally distrib-uted data the likelihood is given by

(22)

Multiplying the equation above by the prior yields

We concern ourselves with only those terms that involveμ the parameter of interest Other terms are used to nor-malize the posterior density and are absorbed into theconstant of proportionality

where

(23)

and

(24)

The posterior is proportional to the density of a normalwith mean and variance given by μprime and σ 2prime respectivelyBecause the conditional posterior distribution must inte-grate to 10 this distribution must be that of a normalwith mean μprime and variance σ 2prime

(25)Unfortunately the notation does not make it clear thatboth μprime and σ 2prime depend on the value of σ 2 When this de-pendence is critical it will be made explicit μprime(σ 2) andσ 2prime(σ 2)

Because the posterior and prior are from the samefamily the normal distribution is the conjugate prior forthe population mean with normally distributed data (withknown variance) In the limit that the prior is diffuse (ieas its variance is made increasingly large) the posteriorof μ is a normal with μprime w and σ 2prime σ 2N This dis-tribution is also the sampling distribution of the samplemean in classical statistics Hence for the diffuse priorthe Bayesian and conventional approaches yield the sameresults

Figure 8 shows the estimation for an example in whichthe data are w (112 106 104 111) and the variance isassumed known to be 16 The mean of these four obser-vations is 10825 For demonstration purposes the prior

μ σ μ σ| ~ 2 2w Normal prime( )prime

prime = +⎛⎝⎜

⎞⎠⎟

minus minus primesumμ σ σ μ σ20

20

2w jj

σ σ σprime minus minus minus= +( )2 2

02 1

N

f μ σμ μ

σ| exp 2

2

22w( ) minus minus prime( )⎛

⎝⎜⎜

⎠⎟⎟prime

f N( | ) expμ σσ σ

μ22

02

212

1w minus +⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

⎣⎢⎢

⎝⎜

minusminussum

+⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

+sum

+⎧⎨

22

0

02

2

202

02

w

w

jj

jj

σμσ

μ

σμσ

⎪⎪

⎩⎪

⎫⎬⎪

⎭⎪

⎦⎥⎥

⎠⎟

fw wj jjμ σ

μ μ

σ

μ

| exp22 2

212

2w( ) minus

minus +( )⎡

⎣⎢⎢

⎝⎜⎜

+

sum

220 0

2

02

2minus + ⎤

⎦⎥⎥

⎠⎟

μ μ μσ

fw j

j

μ σμ

σ

μ μ

| exp

exp

2

2

2

0

2w( ) minus minus( )⎛

⎜⎜

⎟⎟

timesminus minus

sum

(( )⎛

⎝⎜⎜

⎠⎟⎟

2

022σ

fw

N

j

j

μ σπσ

μ

σ| exp

2

2 2

2

21

2 2w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

timesminus minus( )⎛

⎝⎜⎜

⎠⎟⎟

1

2 20

0

2

02π σ

μ μ

σexp

fw

N

j

j

w | exp

μ σπσ

μ

σ2

2 2

2

21

2 2( ) =

( )minus minus( )⎛

⎜⎜

⎟sum ⎟⎟

f ( ) exp μπσ

μ μσ

=minus minus( )⎛

⎝⎜⎜

⎠⎟⎟

1

2 20

0

2

02

μ μ σ~ Normal 0 02( )

f f fμ σ μ σ μ σ| | | 2 2 2w w( ) ( ) ( )

HIERARCHICAL BAYESIAN MODELS 583

is constructed to be fairly informative with μ0 100and σ 0

2 100 The posterior is narrower than the priorreflecting the information gained from the data It is cen-tered at μprime 1079 with variance 385 Note that theposterior mean is not the sample mean but is modestlyinfluenced by the prior Whether the Bayesian or the con-ventional estimate is better depends on the accuracy ofthe prior For cases in which much is known about thedependent measure it may be advantageous to reflectthis information in the prior

Estimating σσ 2 With Known μμIn the last section the posterior for μ was derived for

known σ 2 In this section it is assumed that μ is knownand that the estimation of σ 2 is of primary interest Thegoal then is to estimate the posterior f (σ 2 | μ w) The inverse-gamma distribution is the conjugate prior for σ 2

with normally distributed data The inverse-gamma dis-tribution is not widely used in psychology As its nameimplies it is related to the gamma distribution If a ran-dom variable g is distributed as a gamma then 1g is dis-tributed as an inverse gamma An inverse-gamma prioron σ 2 has density

The parameters of the inverse gamma are a and b and inapplication these are chosen beforehand As a and b ap-proach 0 the inverse gamma approaches 1σ 2 which isconsidered the appropriate noninformative prior for σ 2

(Jeffreys 1961)

Multiplying the inverse-gamma prior by the likelihoodgiven in Equation 22 yields

Collecting only those terms dependent on σ 2 yields

where

aprime N2 a (26)and

(27)

The posterior is proportional to the density of an inversegamma with parameters aprime and bprime Because this posteriorintegrates to 10 it must also be an inverse gamma

σ 2 | μ w ~ Inverse Gamma (aprime bprime) (28)

Note that posterior parameter bprime depends explicitly onthe value of μ

MARKOV CHAIN MONTE CARLO SAMPLING

The preceding development highlights a problem Theuse of conjugate priors allowed for the derivation of pos-teriors only when some of the parameters were assumedto be known The posteriors in Equations 25 and 28 areknown as conditional posteriors because they depend onother parameters The quantities of primary interest arethe marginal posteriors μ | w and σ 2 | w The straightfor-

prime = minus( )⎡⎣⎢

⎤⎦⎥

+sumb w bj j μ2

2

f ba

σ μσ σ

2

2 1 21| exp w( )

( )minus prime⎛

⎝⎜⎞⎠⎟prime+

f

w b

N a

jj

σ μσ

μ

σ

2

2 2 1

2

2

1

5

|

exp

w( )

( )times minus

minus( ) +⎛

+ +

sum

⎝⎝

⎜⎜

⎟⎟

fw

N

j

j

σ μσ

μ

σ2

2 2

2

21

2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟⎟

times

sum

112 1 2

σ σ( )minus⎛

⎝⎜⎞⎠⎟+a

bexp

fw

N

j

j

σ μπσ

μ

σ2

2 2

2

21

2 2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

times( )

minus⎛⎝⎜

⎞⎠⎟+

b

a

ba

aΓ( )

exp σ σ2 1 2

f b

a

b a ba

σ σσ2

2 1 22 0( ) =

( )minus⎛

⎝⎜⎞⎠⎟

ge+

Γ( )exp

Figure 8 Prior and posterior distributions for estimating themean of normally distributed data with known variance Theprior and posterior are the dashed and solid distributions re-spectively The vertical line shows the sample mean The priorldquopullsrdquo the posterior slightly downward in this case

584 ROUDER AND LU

ward method of obtaining marginals is to integrate conditionals

In this case the integral is tractable but in many cases itis not When an integral is symbolically intractable how-ever it can often still be evaluated numerically One ofthe recent developments in numeric integration is Markovchain Monte Carlo (MCMC) sampling MCMC sam-pling has become quite popular in the last decade and isdescribed in detail in Gelfand and Smith (1990) We showhere how MCMC can be used to estimate the marginalposterior distributions of μ and σ 2 without assumingknown values We use a particular type of MCMC sam-pling known as Gibbs sampling (Geman amp Geman 1984)Gibbs sampling is a restricted form of MCMC the moregeneral form is known as MetropolisndashHastings MCMC

The goal is to find the marginal posterior distribu-tions MCMC provides a set of random samples from themarginal posterior distributions (rather than a closed-form derivation of the posterior density or cumulativedistribution function) Obtaining a set of random sam-ples from the marginal posteriors is sufficient for analy-sis With a sufficiently large sample from a random vari-able one can compute its density mean quantiles andany other statistic to arbitrary precision

To explain Gibbs sampling we will digress momen-tarily using a different example Consider the case ofgenerating samples from an ex-Gaussian random vari-able The ex-Gaussian is a well-known distribution incognitive psychology (Hohle 1965) It is the sum of nor-mal and exponential random variables The parametersare μ σ and τ the location and scale of the normal andthe scale of the exponential respectively An ex-Gaussianrandom variable x can be expressed in conditional form

andη ~ Exponential(τ )

We can take advantage of this conditional form to gen-erate ex-Gaussian samples First we generate a set ofsamples from η Samples will be denoted with squarebrackets [η]1 denotes the first sample [η]2 denotes thesecond and so on These samples can then be used in theconditional form to generate samples from the marginalrandom variable x For example the first two samples aregiven by

and

In this manner a whole set of samples from the mar-ginal x can be generated from the conditional randomvariable x |η Figure 9 shows that the method works Ahistogram of [x] generated in this manner converges tothe true ex-Gaussian density This is an example of MonteCarlo integration of a conditional density

We now return to the problem of deriving marginalposterior distributions of μ | w from the conditionalμ |σ 2 w If there was a set of samples from σ 2 | w MonteCarlo integration could be used to sample μ | w Like-wise if there was a set of samples of μ | w Monte Carlointegration could be used to sample σ 2 | w In Gibbs sam-pling these relations are used iteratively as followsFirst a value of [σ 2]1 is picked arbitrarily This value isthen used to sample the posterior of μ from the condi-tional in Equation 25

where μprime and σ 2 prime are the explicit functions of σ 2 givenin Equations 23 and 24 Next the value of [μ]1 is used tosample σ 2 from the conditional posterior in Equation 28In general for iteration m

(29)

and

(30)

In this manner sequences of random samples [μμ] and[σσ 2] are obtained

The initial samples of [μμ] and [σσ 2] certainly reflect thechoice of starting values [σ 2]1 and are not samples fromthe desired marginal posteriors It can be shown how-ever that under mild technical conditions the later sam-ples of [μμ] and [σσ 2] are samples from the desired mar-ginal posteriors (Tierney 1994) The initial region in

σ μ21

⎡⎣ ⎤⎦ prime prime ⎡⎣ ⎤⎦( )( )minusm ma b~ Inverse Gamma

[ ] μ μ σ σ σm m m~ Normal prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime2 2 2

μ μ σ σ σ⎡⎣ ⎤⎦ prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime1

2 2~ Normal1

2

1

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )2 22~ Normal +μ η σ

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )1 12~ Normal +μ η σ

x | ~ η μ η σNormal +( )2

f f f d( | ) | | μ μ σ σ σw w w= ( ) ( )int 2 2 2

Figure 9 An example of Monte Carlo integration of a normalconditional density to yield ex-Gaussian distributed samples Thehistogram presents the samples from a normal whose mean pa-rameter is distributed as an exponential The line is the densityof the appropriate ex-Gaussian distribution

HIERARCHICAL BAYESIAN MODELS 585

which the samples reflect the starting value is termed theburn-in period The later samples are the steady-state pe-riod Formally speaking the samples form irreducibleMarkov chains with stationary distributions that equalthe marginal posterior distribution (see Gilks Richard-son amp Spiegelhalter 1996) On a more informal levelthe first set of samples have random and arbitrary valuesAt some point by chance the random sample happens tobe from a high-density region of the true marginal pos-terior From this point forward the samples are from thetrue posteriors and may be used for estimation

Let m0 denote the point at which the samples are ap-proximately steady state Samples after m0 can be used toprovide both posterior means and credible regions of theposterior (as well as other statistics such as posteriorvariance) Posterior means are estimated by the arith-metic means of the samples

and

where M is the number of iterations Credible regions areconstructed as the centermost interval containing 95of the samples from m0 to M

To use MCMC to estimate posteriors of μ and σ 2 it isnecessary to sample from a normal and an inverse-gamma

distribution Sampling from a normal is provided in manysoftware packages and routines for low-level languagessuch as C or Fortran are readily available Samples froman inverse-gamma distribution may be obtained by tak-ing the reciprocal of samples from a gamma distributionAhrens and Dieter (1974 1982) provide algorithms forsampling the gamma distribution

To illustrate Gibbs sampling consider the case of es-timating IQ The hypothetical data for this example arefour replicates from a single participant w (112 106104 111) The classical estimation of μ is the samplemean (10825) and confidence intervals are constructedby multiplying the standard error (sw193) by the ap-propriate critical t value with three degrees of freedomFrom this multiplication the 95 confidence intervalfor μ is [1022 1144] Bayesian estimation was donewith Gibbs sampling Diffuse priors were placed on μ(μ0 0 σ 0

2 106) and σ 2 (a 106 b 106) Twochains of [μμ] are shown in the left panel of Figure 10each corresponding to a different initial value for [σ 2]1As can be seen both chains converge to their steady staterather quickly For this application there is very littleburn-in The histogram of the samples of [μμ] after burn-in is also shown (right panel) The histogram conforms toa t distribution with three degrees of freedom The poste-rior mean is 10826 and the 95 credible interval is[1022 1144] For normally distributed data the diffusepriors used in Bayesian analysis yield results that are nu-merically equivalent to their frequentist counterparts

The diffuse priors used in the previous case representthe situation in which researchers have no previous knowl-

σσ

2

2

0

0=

⎡⎣ ⎤⎦

minusgtsum

mm m

M m

μμ

=minus

gtsum [ ]m

m m

M m0

0

15

10

05

0

120

115

110

105

100

0 10 20 30 40 50 95 100 105 110 115 120IQ PointsIteration

Den

sity

IQ points

micro

Burn-in

Figure 10 Samples of μμ from Markov chain Monte Carlo integration for normally dis-tributed data The left panel shows two chains each from a poor choice of the initial value ofσσ 2 (the solid and dotted lines are for [σσ 2]1 10000 and [σσ 2]1 0001 respectively) Con-vergence is rapid The right panel shows the histogram of samples of μμ after burn-in Thedensity is the appropriate frequentist t distribution with three degrees of freedom

586 ROUDER AND LU

edge In the IQ example however much is known aboutthe distribution of IQ scores Suppose our test was normedfor a mean of 100 and a standard deviation of 15 Thisknowledge could be profitably used by setting μ0 100and σ 0

2 152 225 We may not know the particulartestndashretest correlation of our scales but we can surmisethat the standard deviations of replicates should be lessthan the population standard deviation After experiment-ing with the values of a and b we chose a 3 and b 100 The prior on σ 2 corresponding to these values isshown in Figure 11 It has much of its density below 100reflecting our belief that the testndashretest variability issmaller than the variability in the population Figure 11also shows a histogram of the posterior of μ (the posteriorwas obtained with Gibbs sampling as outlined above[σ 2]1 1 burn-in of 100 iterations) The posterior meanis 10795 which is somewhat below the sample mean of10825 This difference is from the prior people on aver-age score lower than the obtained scores Although it is abit hard to see in the figure the posterior distribution ismore narrow in the extreme tails than is the correspond-ing t distribution The 95 credible interval is [10211136] which is modestly different from the correspond-ing frequentist confidence interval This analysis shows

that the use of reasonable prior information can lead to aresult different from that of the frequentist analysis TheBayesian estimate is not only different but often more ef-ficient than the frequentist counterpart

The theoretical underpinnings of MCMC guaranteethat infinitely long chains will converge to the true pos-terior Researchers must decide how long to burn in chainsand then how long to continue in order to approximatethe posterior well The most important consideration inthis decision is the degree of correlation from one sam-ple to another In MCMC sampling consecutive samplesare not necessarily independent of one another The valueof a sample on cycle m 1 is in general dependent onthat of m If this dependence is small then convergencehappens quickly and relatively short chains are neededIf this dependence is large then convergence is muchslower One informal graphical method of assessing thedegree of dependence is to plot the autocorrelation func-tions of the chains The bottom right panel of Figure 11shows that for the normal application there appears to beno autocorrelation that is there is no dependence fromone sample to the next This fact implies that conver-gence is relatively rapid Good approximations may beobtained with short burn-ins and relatively short chains

Figure 11 Analysis with informative priors The two top panels showthe prior distributions on μμ and σσ 2 The bottom left panel shows the his-togram of samples of μμ after burn-in It differs from a t distribution fromfrequentist analysis in that it is shifted toward 100 and has a smallerupper tail The bottom right panel shows the autocorrelation functionfor μμ There is no discernible autocorrelation

HIERARCHICAL BAYESIAN MODELS 587

There are a number of formal procedures for assessingconvergence in the statistical literature (see eg Gel-man amp Rubin 1992 Geweke 1992 Raftery amp Lewis1992 see also Gelman et al 2004)

A PROBIT MODEL FOR ESTIMATING A PROBABILITY OF SUCCESS

The next step toward a hierarchical signal detectionmodel is to revisit estimation of a probability from indi-vidual trials Once again y successes are observed in Ntrials and y ~ Binomial(N p) where p is the parameterof interest In the previous section a beta prior was placedon parameter p and the conjugacy of the beta with the bi-nomial directly led to the posterior Unfortunately a betadistribution prior is not convenient as a prior for the sig-nal detection application

The signal detection model relies on the probit trans-form Figure 12 shows the probit transform which mapsprobabilities into z values using the normal inverse cu-mulative distribution function Whereas p ranges from 0to 1 z ranges across the real number line Let Φ denotethe standard normal cumulative distribution function Withthis notation p Φ(z) and conversely z Φ1( p)

Signal detection parameters are defined in terms ofprobit transforms

and

where p(h) and p(f ) are hit and false alarm probabilities re-spectively As a precursor to models of signal detectionwe develop estimation of the probit transform of a proba-bility of success z Φ1( p) The methods for doing sowill then be used in the signal detection application

When using a probit transform we model the y suc-cesses in N trials as

y ~ Binomial[N Φ(z)]

where z is the parameter of interestWhen using a probit it is convenient to place a normal

prior on z

(31)

Figure 13 shows various normal priors on z and by trans-form the corresponding prior on p Consider the specialcase when a standard normal is placed on z As is shownin the figure the corresponding prior on p is flat A flatprior also corresponds to a beta with a b 1 (see Fig-ure 6) From the previous development if a beta prior isplaced on p the posterior is also a beta with parametersaprime a y and bprime b (N y) Noting for a flat priorthat a b 1 it is expected that the posterior of p is dis-tributed as a beta with parameters aprime 1 y and bprime 1 (N y)

The next step is to derive the posterior f (z |Y ) for anormal prior on z The straightforward approach is tonote that the posterior is proportional to the likelihoodmultiplied by the prior

This equation is intractable Albert and Chib (1995) pro-vide an alternative for analysis and the details of this al-ternative are provided in Appendix A The basic idea isthat a new set of latent variables is defined upon which zis conditioned Conditional posteriors for both z andthese new latent variables are easily derived and sam-pled These samples can then be used in Gibbs samplingto find a marginal posterior for z

In this application it may be desirable to estimate theposterior of p as well as that of z This estimation is easyin MCMC Because p is the inverse probit transform ofz samples can also be inversely transformed Let [z] bethe chain of samples from the posterior of z A chain ofposterior samples of p is given by [ p]m Φ( [z]m)

The top right panel of Figure 14 shows the histogramof the posterior of p for y 7 successes on N 10 tri-als The prior parameters were μ0 0 and σ 0

2 1 so theprior on p was flat (see Figure 13) Because the prior isflat it corresponds to a beta with parameters (a 1 b 1) Consequently the expected posterior is a beta withdensity aprime y 1 8 and bprime (N y) 1 4 Thechains in the Gibbs sampler were started from a numberof values of [z]1 and convergence was always rapid Theresulting histogram for 10000 samples is plotted it ap-proximates the expected distribution The bottom panelshows a small degree of autocorrelation This autocorre-lation can be overcome by running longer chains andusing a more conservative burn-in period Chains of

f z Y z z

z

Y N Y( | ) ( ) [ ( )]

exp

Φ Φ1

20

2

02 1

minus

times minus minus( ) ( )minus

minusμ σ⎡⎡

⎣⎢⎤⎦⎥

z ~ Normal 0μ σ02( )

c p= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pΦ Φ1 1( ) ( )h f

Figure 12 Probit transform of probability ( p) into z scores (z)The transform is the inverse cumulative distribution function ofthe normal distribution Lines show the transform of p 69 intoz 05

588 ROUDER AND LU

1000 samples with 100 samples of burn-in are morethan sufficient for this application

ESTIMATING PROBABILITY FORSEVERAL PARTICIPANTS

In this section we propose a model for estimating sev-eral participantsrsquo probability of success The main goalis to introduce hierarchical models within the context ofa relatively simple example The methods and insightsdiscussed here are directly applicable to the hierarchicalmodel of signal detection Consider data from a numberof participants each with his or her own unique proba-bility of success pi In conventional estimation each ofthese probabilities is estimated separately usually by theestimator piyi Ni where yi and Ni are the number ofsuccesses and trials respectively for the ith participantWe show here that hierarchical modeling leads to moreaccurate estimation

In a hierarchical model it is assumed that although in-dividuals vary in ability the distribution governing indi-vidualsrsquo true abilities is orderly We term this distributionthe parent distribution If we knew the parent distributionthis prior information would be useful in estimationConsider the example in Figure 15 Suppose 10 individ-uals each performed 10 trials Hypothetical performanceis indicated with Xs Letrsquos suppose the task is relativelyeasy so that individualsrsquo true probabilities of successrange from 7 to 10 as indicated by the curve in the fig-

ure One participant does poorly however succeeding ononly 3 trials The conventional estimate for this poor-performing participant is 3 which is a poor estimate ofthe true probability (between 7 and 10) If the parentdistribution is known it would be evident that this poorscore is influenced by a large degree of sample noise andshould be corrected upward as indicated by the arrowThe new estimate which is more accurate is denoted byan H (for hierarchical)

In the example above we assumed a parent distribu-tion In actuality such assumptions are unwarranted Inhierarchical models parent distributions are estimatedfrom the data these parents in turn affect the estimatesfrom outlying data The estimated parent for the data inFigure 15 would have much of its mass in the upperranges When serving as a prior the estimated parentwould exert an upward tendency in the estimation of theoutlying participantrsquos probability leading to better esti-mation In Bayesian analysis the method of implement-ing a hierarchical model is to use a hierarchical priorThe parent distribution serves as a prior for each indi-vidual and this parent is termed the first stage of theprior The parent distribution is not fully specified butinstead free parameters of the parent are estimated fromthe data These parameters also need a prior on themwhich is termed the second stage of the prior

The probit transform model is ideally suited for a hi-erarchical prior A normally distributed parent may beplaced on transformed probabilities of success The ith

Prior 1 Prior 2 Prior 3

Priorson z

Corresponding Priors on p

N(01) N(11) N(0 2)4

3

2

1

0

4

3

2

1

0

4

3

2

1

0

ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6

5

4

3

2

1

0

5

4

3

2

1

0

5

4

3

2

1

0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Figure 13 Three normally distributed priors on z and the corresponding priors on p Priors on z were obtainedby drawing 100000 samples from the normal Priors on p were obtained by transforming each sample of z

HIERARCHICAL BAYESIAN MODELS 589

participantrsquos true ability zi is simply a sample from thisnormal The number of successes is modeled as

(32)

with a parent distribution (first-stage prior) given by

(33)

In this case parameters μ and σ 2 describe the locationand variance of the parent distribution and reflect group-level properties Priors are needed on μ and σ 2 (second-stage priors) Suitable choices are

(34)

andσ 2 ~ Inverse Gamma (a b) (35)

Values of μ0 σ 02 a and b must be specif ied before

analysisWe performed an analysis on hypothetical data in order

to demonstrate the hierarchical model Table 2 shows hy-pothetical data for a set of 20 participants performing 50trials These success frequencies were generated from abinomial but each individual had a unique true proba-bility of success These true probabilities2 which rangedbetween 59 and 80 are shown in the second column ofTable 2 Parameter estimators p0 (Equation 12) and p2(Equation 14) are shown

All of the theoretical results needed to derive the con-ditional posterior distributions have already been pre-

sented The Albert and Chib (1995) analysis for this ap-plication is provided in Appendix B A coded version ofMCMC for this model is presented in Appendix C

The MCMC chain was run for 10000 iterations Pri-ors were set to be fairly diffuse (μ0 0 σ 0

2 1000 a b 1) All parameters converged to steady-state behav-ior quickly (burn-in was conservative at 100 iterations)Autocorrelation of samples of pi for 2 select participantsis shown in the bottom panels of Figure 16 The auto-correlation in these plots was typical of other param-eters Autocorrelation for this model was no worse thanthat for the nonhierarchical Bayesian model of the pre-ceding section The top left panel of Figure 16 showsposterior means of the probability of success as a func-tion of the true probability of success for the hierarchi-cal model (filled circles) Also included are classical estimates from p0 (filled circles) Points from the hier-archical model are on average closer to the diagonalAs is indicated in Table 2 the hierarchical model haslower root mean squared error than does its frequentistcounterpart p0 The top right panel shows the hierarchi-cal estimates as a function of p0 The effect of the hier-archical structure is clear The extreme estimates in thefrequentist approach are moderated in the Bayesiananalysis

Estimator p2 also moderates extreme values The ten-dency is to pull them toward the middle specifically es-timates are closer to 5 than they are with p0 For the dataof Table 2 the difference in efficiency between p2 andp0 is exceedingly small (about 0001) The reason thatthere was little gain for p2 is that it pulled estimates to-ward a value of 5 which is not the mean of the trueprobabilities The mean was p 69 and for this valueand these sample sizes estimators p2 and p0 have aboutthe same efficiency The hierarchical estimator is supe-

μ μ σ~ Normal 02( )

zi | μ σ μ σ2 ind~ Normal 2( )

y z N zi i i i| ind~ Binomial Φ( )⎡⎣ ⎤⎦

Figure 14 (Top) Histograms of the post-burn-in samples of zand p respectively for 7 successes out of 10 trials The line fit forp is the true posterior a beta distribution with a primeprime 8 and bprimeprime 4(Bottom) Autocorrelation function for z There is a small degreeof autocorrelation

Figure 15 Example of how a parent distribution affects esti-mation The outlying observation is incompatible with the parentdistribution Bayesian estimation allows for the influence of theparent as a prior The resulting estimate is shifted upward re-ducing error

590 ROUDER AND LU

rior to these nonhierarchical estimators because it pullsoutlying estimates toward the mean value p Of coursethe advantage of the hierarchical model is a function ofsample size With increasing sample sizes the amountof pulling decreases In the limit all three estimatorsconverge to the true individualrsquos probability

A HIERARCHICAL SIGNAL DETECTION MODEL

Our ultimate goal is a signal detection model that ac-counts for both participant and item variability As a pre-cursor we present a hierarchical signal detection modelwithout item variability the model will be expanded inthe next section to include item variability The modelhere is applicable to psychophysical experiments in whichsignal and noise trials are identical replicates respectively

Signal detection parameters are simple subtractions ofprobit-transformed probabilities Let d primei and ci denotesignal detection parameters for the ith participant

and

where pi(h) and p

i(f ) are individualsrsquo true hit and false

alarm probabilities respectively Let hi and fi be the pro-bit transforms hi1( p i

(h)) and fi1( pi(f )) Then

sensitivity and criteria are given by

d primei hi fiand

ci fi

The tactic we follow is to place hierarchical models on hand f rather than on signal detection parameters directlyBecause the relationship between signal detection pa-rameters and probit-transformed probabilities is linearaccurate estimation of hi and fi results in accurate esti-mation of d primei and ci

The model is developed as follows Let Ni and Si de-note the number of noise and signal trials given to the ithparticipant Let y

i(h) and y

i(f ) denote the the number of

hits and false alarms respectively The model for yi(h) and

yi(f ) is given by

and

It is assumed that each participantrsquos parameters aredrawn from parent distributions These parent distribu-tions serve as the first stage of the hierarchical prior andare given by

(36)

and

(37)

Second-stage priors are placed on parent distribution pa-rameters (μk σ k

2) as

(38)

and

(39)σ k k ka b k2 ~ Inverse Gamma h f( ) =

μk k ku v k~ Normal h f( ) =

f iind~ Normal f fμ σ 2( )

hiind~ Normal h hμ σ 2( )

y f N fi i i i( ) | f ind

~ Binomial Φ( )⎡⎣ ⎤⎦

y h S hi i i i( ) | h ind

~ Binomial Φ( )⎡⎣ ⎤⎦

c pi i= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pi i iΦ Φ1 1( ) ( )h f

Table 2Data and Estimates of a Probability of Success

Participant True Prob Data p0 p2 Hierarchical

1 587 33 66 654 6722 621 28 56 558 6153 633 34 68 673 6834 650 29 58 577 6275 659 32 64 635 6606 665 37 74 731 7167 667 30 60 596 6388 667 29 58 577 6279 687 34 68 673 684

10 693 34 68 673 68211 696 34 68 673 68312 701 37 74 731 71513 713 38 76 750 72814 734 38 76 750 72715 736 37 74 731 71716 738 33 66 654 67217 743 37 74 731 71618 759 33 66 654 67219 786 38 76 750 72820 803 42 84 827 771

RMSE 053 053 041Data are numbers of successes in 50 trials

HIERARCHICAL BAYESIAN MODELS 591

Analysis takes place in two steps The first is to obtainposterior samples from hi and fi for each participantThese are probit-transformed parameters that may be es-timated following the exact procedure in the previoussection Separate runs are made for signal trials (produc-ing estimates of hi) and for noise trials (producing esti-mates of fi) These posterior samples are denoted [hi] and[ fi] respectively The second step is to use these posteriorsamples to estimate posterior distributions of individuald primei and ci This is easily accomplished For all samples

(40)

and(41)

Hypothetical data and parameter estimates are shownin Table 3 Individual variability in the data was imple-mented by sampling each participantrsquos d prime and c from alog-normal and from a normal distribution respectivelyTrue d prime values are shown and they range from 123 to234 From these true values of d prime and c true individualhit and false alarm probabilities were computed and in-dividual hit and false alarm frequencies were sampledfrom a binomial Hit and false alarm frequencies wereproduced in this manner for 20 hypothetical individualseach observing 50 signal and 50 noise trials

A conventional sensitivity estimate for each individ-ual was obtained using the formula

Φ1(Hit rate) Φ1(False alarm rate)

The resulting estimates are shown in the last column ofTable 3 labeled Conventional The hierarchical signaldetection model was analyzed with diffuse prior param-eters ah af bh bf 01 uh uf 0 and vh vf 1000 The chain was run for 10000 iterations with thefirst 500 serving as burn-in This choice of burn-in pe-riod is very conservative Autocorrelation for sensitivityfor 2 select participants is displayed in the bottom pan-els of Figure 17 this degree of autocorrelation was typ-ical for all parameters Autocorrelation was about thesame as in the previous application and the resulting es-timates are shown in the next-to-last column of Table 3

The top left panel of Figure 17 shows a plot of esti-mated dprime as a function of true d prime Estimates from the hi-erarchical model are closer to the diagonal than are theconventional estimates indicating more accurate esti-mation Table 3 also provides efficiency information thehierarchical model is 33 more efficient than the con-ventional estimates (this amount is typical of repeatedruns of the same simulation) The reason for this gain isevident in the top right panel of Figure 17 The extremeestimates in the conventional method which most likely

c fi m i m⎡⎣ ⎤⎦ = minus ⎡⎣ ⎤⎦

prime⎡⎣ ⎤⎦ = ⎡⎣ ⎤⎦ minus ⎡⎣ ⎤⎦d h fi m i m m

ConventionalHierarchical

Participant 5 Participant 15

True Probability Conventional Estimates

Lag Lag

5 6 7 8 9 5 6 7 8 9

0 10 20 30 40 0 10 20 30 40

Aut

ocor

rela

tion

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

9

8

7

6

5

8

4

0

9

8

7

6

5

8

4

0

Figure 16 Modeling the probability of success across several partici-pants (Top left) Probability estimates as functions of true probabilitythe triangles represent estimator p0 and the circles represent posteriormeans of pi from the hierarchical model (Top right) Hierarchical esti-mates as a function of p0 (Bottom) Autocorrelation functions of the pos-terior probability estimates for 2 participants from the hierarchicalmodel The degree of autocorrelation is typical of the other participants

592 ROUDER AND LU

reflect sample noise are shrunk to more moderate esti-mates in the hierarchical model

SIGNAL DETECTION WITH ITEMVARIABILITY

In the introduction we stressed that Bayesian hierar-chical modeling may provide a solution for the statisti-cal problems associated with unmodeled variability innonlinear contexts In all of the previous examples thegains from hierarchical modeling were in better effi-ciency Conventional analysis although not as accurateas Bayesian analysis certainly yielded consistent if notunbiased estimates In this section item variability isadded to signal detection Conventional analyses inwhich scores are aggregated over items result in incon-sistent estimates characterized by asymptotic bias Inthis section we show that in contrast to conventionalanalysis Bayesian hierarchical models provide consis-tent and accurate estimation

Our route to the model is a bit circuitous Our first at-tempt based on a straightforward extension of the pre-vious techniques fails because of an excessive amountof autocorrelation The solution to this failure is to use apowerful and recently developed sampling schemeblocked sampling (Roberts amp Sahu 1997) First we willproceed with the straightforward but problematic exten-sion Then we will introduce blocked sampling and showhow it leads to tractable analysis

Straightforward ExtensionIt is straightforward to specify models that include

item effects The previous notation is here expanded byindexing hits and false alarms by both participants and

items Let yij(h) and yij

(f ) denote the response of the ith per-son to the jth item Values of yij

(h) and yij(f ) are dichoto-

mous indicating whether the response was old or newLet pij

(h) and pij(f ) be true probabilities that yij

(h) 1 andyij

(f ) 1 respectively that is these probabilities are truehit and false alarm rates for a given participant-by-itemcombination respectively Let hij and fij be the probittransforms [hij Φ1( pij

(h)) fij Φ1( pij(f ))] The model

is given by

(42)

and

(43)

Linear models are placed on hij and fij

hij μh αi γj (44)

and

fij μf βi δj (45)

Parameters μh and μf correspond to the grand means ofthe hit and false alarm rates respectively Parameters ααand ββ denote participant effects on hits and false alarms respectively parameters γγ and δδ denote item effects onhits and false alarms respectively These participant anditem effects are assumed to be zero-centered random ef-fects Grand mean signal detection parameters denotedd prime and c are given by

d prime μh μf (46)

and

c μf (47)

y fij ij( ) f ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

y hij ij( )h ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

Table 3Data and Estimates of Sensitivity

FalseParticipant True d prime Hits Alarms Hierarchical Conventional

1 1238 35 7 1608 16052 1239 38 17 1307 11193 1325 38 11 1554 14784 1330 33 16 1146 08805 1342 40 19 1324 11476 1405 39 8 1730 17677 1424 37 12 1466 13508 1460 27 5 1417 13829 1559 35 9 1507 1440

10 1584 37 9 1599 155911 1591 33 8 1486 140712 1624 40 7 1817 192213 1634 35 11 1427 129714 1782 43 13 1687 172415 1894 33 3 1749 196716 1962 45 12 1839 198817 1967 45 4 2235 268718 2124 39 6 1826 194719 2270 45 5 2172 256320 2337 46 6 2186 2580

RMSE 183 275Data are numbers of signal responses in 50 trials

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 7: Rouder & Lu (2005)

HIERARCHICAL BAYESIAN MODELS 579

Bayesrsquos (1763) insightful contribution was to use thelaw of conditional probability to estimate the parameterp He noted that

(15)

We describe each of these terms in orderThe main goal is to find f ( p | y) the quantity on the

left-hand side of Equation 15 The term f( p | y) is the dis-tribution of the parameter p given the data y This distri-bution is referred to as the posterior distribution It isviewed as a function of the parameter p The mean of theposterior distribution termed the posterior mean servesas a suitable point estimator for p

The term f(y | p) plays two roles in statistics depend-ing on context When it is viewed as a function of y forknown p it is known as the probability mass functionThe probability mass function describes the probabilityof any outcome y given a particular value of p For a bi-nomial random variable the probability mass function isgiven by

(16)

For example this function can be used to compute theprobability of observing a total of two successes on threetrials when p 5

38

The term f(y | p) can also be viewed as a function ofparameter p for fixed y In this case it is termed the like-lihood function and describes the likelihood of a partic-ular parameter value p given a fixed value of y For thebinomial the likelihood function is given by

(17)

In Bayesrsquos theorem we are interested in the posterior asa function of the parameter p Hence the right-handterms of Equation 15 are viewed as a function of param-eter p Therefore f (y | p) is viewed as the likelihood of prather than the probability mass of y

The term f (p) is the prior distribution and reflects theexperimenterrsquos a priori beliefs about the true value of p

Finally the term f (y) is the distribution of the datagiven the model Although its interpretation is importantin some contexts it plays a minimal role in the develop-

f y pN

yp p py N y( | ) ( ) =

⎛⎝⎜

⎞⎠⎟

minus le leminus1 0 1

f y p( | ) ( ) ( )= = =⎛⎝⎜

⎞⎠⎟

minus2 53

25 1 52 1

f y p

N

yp p y Ny N y

( | )( )

=⎛⎝⎜

⎞⎠⎟

minus =minus1 0

0 otherwiise

⎧⎨⎪

⎩⎪

f p yf y p f p

f y( | )

( | ) ( )( )

=

Figure 4 Sampling distributions of p0 p1 and p2 for N 10 trialswith a p 7 true probability of success on any one trial Bias and rootmean squared error (RMSE) are included

580 ROUDER AND LU

ment presented here for the following reason The pos-terior is a function of the parameter of interest p butf (y) does not depend on p The term f (y) is a constantwith respect to p and serves as a normalizing constantthat ensures that the posterior density integrates to 1 Aswill be shown in the examples the value of this normal-izing constant usually becomes apparent in analysis

To perform Bayesian analysis the researcher mustchoose a prior distribution For this application of esti-mating a probability from binomially distributed datathe beta distribution serves as a suitable prior distribu-tion for p The beta is a flexible two-parameter distribu-tion see Figure 6 for examples In contrast to the normaldistribution which has nonzero density on all numbersthe beta has nonzero density between 0 and 1 The betadistribution is a function of two parameters a and b thatdetermine its shape The notation for specifying a betadistribution prior for p is

p ~ Beta(a b)

The density of a beta random variable is given by

(18)

The denominator Be(a b) is termed the beta function1

Like f (y) it is not a function of p and plays the role of anormalizing constant in analysis

In practice researchers must completely specify theprior distribution before analysis that is they must choosesuitable values for a and b This choice reflects re-searchersrsquo beliefs about the possible values of p beforedata are collected When a 1 and b 1 (see the mid-

dle panel of Figure 6) the beta distribution is flat that isthere is equal density for all values in the interval [0 1]By choosing a b 1 the researcher is committing toall values of p being equally likely before data are col-lected Researchers can choose values of a and b to bestmatch their beliefs about the expected data For exam-ple consider a psychophysical experiment in which thevalue of p will almost surely be greater than 5 For thisexperiment priors with a b are appropriate

The goal is to derive the posterior distribution usingBayesrsquos theorem (Equation 15) Substituting the likeli-hood function (Equation 17) and prior (Equation 18) intoBayesrsquos theorem yields

Collecting terms in p yields

f ( p | y) kp (ya1)(1p)(Nyb1) (19)

where

The term k is constant with respect to p and serves as anormalizing constant Substituting

aprime y a bprime N y b (20)

into Equation 19 yields

f ( p | y) kp aprime1(1p)bprime1

This equation is proportional to a beta density for pa-rameters aprime and bprime To make f ( p | y) a proper probabilitydensity k has a value such that f ( p | y) integrates to 1

k a b f yN

y= times

⎛⎝⎜

⎞⎠⎟

minus[ ( ) ( )] Be 1

f p yN

y

p p p pa b

y N y a b

( | )( ) ( )

( =

⎛⎝⎜

⎞⎠⎟

minus minusminus minus minus1 11 1

Be )) ( )

times f y

f pp p

a b

a b

( )( )

( )= minusminus minus1 11

Be

Figure 5 Bias and root mean squared error (RMSE) for thethree estimators as functions of true probability of success Thesolid dashed and dasheddotted lines denote the characteristicsof p0 p1 and p2 respectively

Figure 6 Probability density function of the beta distributionfor various values of parameters a and b

HIERARCHICAL BAYESIAN MODELS 581

This occurs only if the value of k is a beta functionmdashforexample k 1Be(a prime bprime) Consequently

(21)

The posterior like the prior is a beta but with param-eters aprime and bprime given by Equation 20

The derivation above was greatly facilitated by sepa-rating those terms that depend on the parameter of inter-est from those that do not The latter terms serve as anormalizing constant and can be computed by ensuringthat the posterior integrates to 10 In most derivationsthe value of the normalizing constant becomes apparentmuch as it did above Consequently it is convenient toconsider only those terms that depend on the parameterof interest and to lump the rest into a proportionality con-stant Hence a convenient form of Bayesrsquos theorem is

Posterior distribution Likelihood function Prior distribution

The symbol ldquordquo denotes proportionality and is read isproportional to This form of Bayesrsquos theorem is usedthroughout this article

Figure 7 shows both the prior and posterior of p forthe case of y 7 successes in N 10 trials The priorcorresponds to a beta with a b 1 and the observeddata are 7 successes in 10 trials The posterior in thiscase is a beta distribution with parameters aprime 8 andbprime 4 As can be seen the posterior is considerablymore narrow than the prior indicating that a large degreeof information has been gained from the data There aretwo valuable quantities derived from the posterior distri-bution the posterior mean and the 95 highest densityregion (also termed the 95 credible interval) The for-mer is the point estimate of p and the latter serves anal-ogously to a confidence interval These two quantitiesare also shown in Figure 7

For the case of a beta-distributed posterior the ex-pression for the posterior mean is

The posterior mean reduces to p1 if a b 5 it reducesto p2 if a b 1 Therefore estimators p1 and p2 aretheoretically justified from a Bayesian perspective Theestimator p0 may be obtained for values of a b 0 Forthese values however the integral of the prior distribu-tion is infinite If the integral of a distribution is infinitethe distribution is termed improper Conversely if the integral is finite the distribution is termed proper InBayesian analysis it is essential that posterior distribu-tions be proper However it is not necessary for the priorto be proper In some cases an improper prior will stillyield a proper posterior When this occurs the analysisis perfectly valid For estimation of p the posterior isproper even when a b 0

When the prior and posterior are from the same distri-bution the prior is termed conjugate The beta distribu-tion is the conjugate prior for binomially distributed databecause the resulting posterior is also a beta distributionConjugate priors are convenient they often facilitate thederivation of posterior distributions Conjugacy howeveris not at all necessary for Bayesian analysis A discussionabout conjugacy is provided in the General Discussion

BAYESIAN ESTIMATION WITH NORMALLY DISTRIBUTED DATA

In this section we present Bayesian analysis for nor-mally distributed data The normal plays a critical role inthe hierarchical signal detection model We will assumethat participant and item effects on sensitivity and biasare distributed as normals The results and techniquesdeveloped in this section are used directly in analyzingthe subsequent hierarchical signal detection models

Consider the case in which a sequence of observa-tions w (w1 wN) are independent and identicallydistributed as normals

The goal is to estimate parameters μ and σ 2 In this sec-tion we discuss a more limited goalmdashthe estimation ofone of the parameters assuming that the other is knownIn the next section we will introduce a form of Markovchain Monte Carlo sampling for estimation when bothparameters are unknown

Estimating μμ With Known σσ 2

Assume that σ 2 is known and that the estimation of μis of primary interest The goal then is to derive the pos-

wjind~ Normal μ σ 2( )

p aa b

y aN a b

= primeprime + prime

= ++ +

f p yp p

a b

a b

( | )( )( )

= minusprime prime

primeminus primeminus1 11Be

Figure 7 Prior and posterior distributions for 7 successes outof 10 trials Both are distributed as betas Parameters for theprior are a 1 and b 1 parameters for the posterior are aprimeprime 8 and b primeprime 4 The posterior mean and 95 highest density regionare also indicated

582 ROUDER AND LU

terior f (μ |σ 2 w) According to the proportional formof Bayesrsquos theorem

All terms are conditioned on σ 2 to reflect the fact that itis known

The first step is to choose a prior distribution for μThe normal is a suitable choice

Parameters μ0 and σ 02 must be specified before analysis

according to the researcherrsquos beliefs about μ By choos-ing σ 0

2 to be increasingly large the researcher can makethe prior arbitrarily variable In the limit the prior ap-proaches having equal density across all values Thisprior is considered noninformative for μ

The density of the prior is given by

Note that the prior on μ does not depend on σ 2 there-fore f (μ) f (μ |σ2)

The next step is multiplying the likelihood f (w | μσ 2) by the prior density f (μ |σ 2) For normally distrib-uted data the likelihood is given by

(22)

Multiplying the equation above by the prior yields

We concern ourselves with only those terms that involveμ the parameter of interest Other terms are used to nor-malize the posterior density and are absorbed into theconstant of proportionality

where

(23)

and

(24)

The posterior is proportional to the density of a normalwith mean and variance given by μprime and σ 2prime respectivelyBecause the conditional posterior distribution must inte-grate to 10 this distribution must be that of a normalwith mean μprime and variance σ 2prime

(25)Unfortunately the notation does not make it clear thatboth μprime and σ 2prime depend on the value of σ 2 When this de-pendence is critical it will be made explicit μprime(σ 2) andσ 2prime(σ 2)

Because the posterior and prior are from the samefamily the normal distribution is the conjugate prior forthe population mean with normally distributed data (withknown variance) In the limit that the prior is diffuse (ieas its variance is made increasingly large) the posteriorof μ is a normal with μprime w and σ 2prime σ 2N This dis-tribution is also the sampling distribution of the samplemean in classical statistics Hence for the diffuse priorthe Bayesian and conventional approaches yield the sameresults

Figure 8 shows the estimation for an example in whichthe data are w (112 106 104 111) and the variance isassumed known to be 16 The mean of these four obser-vations is 10825 For demonstration purposes the prior

μ σ μ σ| ~ 2 2w Normal prime( )prime

prime = +⎛⎝⎜

⎞⎠⎟

minus minus primesumμ σ σ μ σ20

20

2w jj

σ σ σprime minus minus minus= +( )2 2

02 1

N

f μ σμ μ

σ| exp 2

2

22w( ) minus minus prime( )⎛

⎝⎜⎜

⎠⎟⎟prime

f N( | ) expμ σσ σ

μ22

02

212

1w minus +⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

⎣⎢⎢

⎝⎜

minusminussum

+⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

+sum

+⎧⎨

22

0

02

2

202

02

w

w

jj

jj

σμσ

μ

σμσ

⎪⎪

⎩⎪

⎫⎬⎪

⎭⎪

⎦⎥⎥

⎠⎟

fw wj jjμ σ

μ μ

σ

μ

| exp22 2

212

2w( ) minus

minus +( )⎡

⎣⎢⎢

⎝⎜⎜

+

sum

220 0

2

02

2minus + ⎤

⎦⎥⎥

⎠⎟

μ μ μσ

fw j

j

μ σμ

σ

μ μ

| exp

exp

2

2

2

0

2w( ) minus minus( )⎛

⎜⎜

⎟⎟

timesminus minus

sum

(( )⎛

⎝⎜⎜

⎠⎟⎟

2

022σ

fw

N

j

j

μ σπσ

μ

σ| exp

2

2 2

2

21

2 2w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

timesminus minus( )⎛

⎝⎜⎜

⎠⎟⎟

1

2 20

0

2

02π σ

μ μ

σexp

fw

N

j

j

w | exp

μ σπσ

μ

σ2

2 2

2

21

2 2( ) =

( )minus minus( )⎛

⎜⎜

⎟sum ⎟⎟

f ( ) exp μπσ

μ μσ

=minus minus( )⎛

⎝⎜⎜

⎠⎟⎟

1

2 20

0

2

02

μ μ σ~ Normal 0 02( )

f f fμ σ μ σ μ σ| | | 2 2 2w w( ) ( ) ( )

HIERARCHICAL BAYESIAN MODELS 583

is constructed to be fairly informative with μ0 100and σ 0

2 100 The posterior is narrower than the priorreflecting the information gained from the data It is cen-tered at μprime 1079 with variance 385 Note that theposterior mean is not the sample mean but is modestlyinfluenced by the prior Whether the Bayesian or the con-ventional estimate is better depends on the accuracy ofthe prior For cases in which much is known about thedependent measure it may be advantageous to reflectthis information in the prior

Estimating σσ 2 With Known μμIn the last section the posterior for μ was derived for

known σ 2 In this section it is assumed that μ is knownand that the estimation of σ 2 is of primary interest Thegoal then is to estimate the posterior f (σ 2 | μ w) The inverse-gamma distribution is the conjugate prior for σ 2

with normally distributed data The inverse-gamma dis-tribution is not widely used in psychology As its nameimplies it is related to the gamma distribution If a ran-dom variable g is distributed as a gamma then 1g is dis-tributed as an inverse gamma An inverse-gamma prioron σ 2 has density

The parameters of the inverse gamma are a and b and inapplication these are chosen beforehand As a and b ap-proach 0 the inverse gamma approaches 1σ 2 which isconsidered the appropriate noninformative prior for σ 2

(Jeffreys 1961)

Multiplying the inverse-gamma prior by the likelihoodgiven in Equation 22 yields

Collecting only those terms dependent on σ 2 yields

where

aprime N2 a (26)and

(27)

The posterior is proportional to the density of an inversegamma with parameters aprime and bprime Because this posteriorintegrates to 10 it must also be an inverse gamma

σ 2 | μ w ~ Inverse Gamma (aprime bprime) (28)

Note that posterior parameter bprime depends explicitly onthe value of μ

MARKOV CHAIN MONTE CARLO SAMPLING

The preceding development highlights a problem Theuse of conjugate priors allowed for the derivation of pos-teriors only when some of the parameters were assumedto be known The posteriors in Equations 25 and 28 areknown as conditional posteriors because they depend onother parameters The quantities of primary interest arethe marginal posteriors μ | w and σ 2 | w The straightfor-

prime = minus( )⎡⎣⎢

⎤⎦⎥

+sumb w bj j μ2

2

f ba

σ μσ σ

2

2 1 21| exp w( )

( )minus prime⎛

⎝⎜⎞⎠⎟prime+

f

w b

N a

jj

σ μσ

μ

σ

2

2 2 1

2

2

1

5

|

exp

w( )

( )times minus

minus( ) +⎛

+ +

sum

⎝⎝

⎜⎜

⎟⎟

fw

N

j

j

σ μσ

μ

σ2

2 2

2

21

2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟⎟

times

sum

112 1 2

σ σ( )minus⎛

⎝⎜⎞⎠⎟+a

bexp

fw

N

j

j

σ μπσ

μ

σ2

2 2

2

21

2 2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

times( )

minus⎛⎝⎜

⎞⎠⎟+

b

a

ba

aΓ( )

exp σ σ2 1 2

f b

a

b a ba

σ σσ2

2 1 22 0( ) =

( )minus⎛

⎝⎜⎞⎠⎟

ge+

Γ( )exp

Figure 8 Prior and posterior distributions for estimating themean of normally distributed data with known variance Theprior and posterior are the dashed and solid distributions re-spectively The vertical line shows the sample mean The priorldquopullsrdquo the posterior slightly downward in this case

584 ROUDER AND LU

ward method of obtaining marginals is to integrate conditionals

In this case the integral is tractable but in many cases itis not When an integral is symbolically intractable how-ever it can often still be evaluated numerically One ofthe recent developments in numeric integration is Markovchain Monte Carlo (MCMC) sampling MCMC sam-pling has become quite popular in the last decade and isdescribed in detail in Gelfand and Smith (1990) We showhere how MCMC can be used to estimate the marginalposterior distributions of μ and σ 2 without assumingknown values We use a particular type of MCMC sam-pling known as Gibbs sampling (Geman amp Geman 1984)Gibbs sampling is a restricted form of MCMC the moregeneral form is known as MetropolisndashHastings MCMC

The goal is to find the marginal posterior distribu-tions MCMC provides a set of random samples from themarginal posterior distributions (rather than a closed-form derivation of the posterior density or cumulativedistribution function) Obtaining a set of random sam-ples from the marginal posteriors is sufficient for analy-sis With a sufficiently large sample from a random vari-able one can compute its density mean quantiles andany other statistic to arbitrary precision

To explain Gibbs sampling we will digress momen-tarily using a different example Consider the case ofgenerating samples from an ex-Gaussian random vari-able The ex-Gaussian is a well-known distribution incognitive psychology (Hohle 1965) It is the sum of nor-mal and exponential random variables The parametersare μ σ and τ the location and scale of the normal andthe scale of the exponential respectively An ex-Gaussianrandom variable x can be expressed in conditional form

andη ~ Exponential(τ )

We can take advantage of this conditional form to gen-erate ex-Gaussian samples First we generate a set ofsamples from η Samples will be denoted with squarebrackets [η]1 denotes the first sample [η]2 denotes thesecond and so on These samples can then be used in theconditional form to generate samples from the marginalrandom variable x For example the first two samples aregiven by

and

In this manner a whole set of samples from the mar-ginal x can be generated from the conditional randomvariable x |η Figure 9 shows that the method works Ahistogram of [x] generated in this manner converges tothe true ex-Gaussian density This is an example of MonteCarlo integration of a conditional density

We now return to the problem of deriving marginalposterior distributions of μ | w from the conditionalμ |σ 2 w If there was a set of samples from σ 2 | w MonteCarlo integration could be used to sample μ | w Like-wise if there was a set of samples of μ | w Monte Carlointegration could be used to sample σ 2 | w In Gibbs sam-pling these relations are used iteratively as followsFirst a value of [σ 2]1 is picked arbitrarily This value isthen used to sample the posterior of μ from the condi-tional in Equation 25

where μprime and σ 2 prime are the explicit functions of σ 2 givenin Equations 23 and 24 Next the value of [μ]1 is used tosample σ 2 from the conditional posterior in Equation 28In general for iteration m

(29)

and

(30)

In this manner sequences of random samples [μμ] and[σσ 2] are obtained

The initial samples of [μμ] and [σσ 2] certainly reflect thechoice of starting values [σ 2]1 and are not samples fromthe desired marginal posteriors It can be shown how-ever that under mild technical conditions the later sam-ples of [μμ] and [σσ 2] are samples from the desired mar-ginal posteriors (Tierney 1994) The initial region in

σ μ21

⎡⎣ ⎤⎦ prime prime ⎡⎣ ⎤⎦( )( )minusm ma b~ Inverse Gamma

[ ] μ μ σ σ σm m m~ Normal prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime2 2 2

μ μ σ σ σ⎡⎣ ⎤⎦ prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime1

2 2~ Normal1

2

1

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )2 22~ Normal +μ η σ

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )1 12~ Normal +μ η σ

x | ~ η μ η σNormal +( )2

f f f d( | ) | | μ μ σ σ σw w w= ( ) ( )int 2 2 2

Figure 9 An example of Monte Carlo integration of a normalconditional density to yield ex-Gaussian distributed samples Thehistogram presents the samples from a normal whose mean pa-rameter is distributed as an exponential The line is the densityof the appropriate ex-Gaussian distribution

HIERARCHICAL BAYESIAN MODELS 585

which the samples reflect the starting value is termed theburn-in period The later samples are the steady-state pe-riod Formally speaking the samples form irreducibleMarkov chains with stationary distributions that equalthe marginal posterior distribution (see Gilks Richard-son amp Spiegelhalter 1996) On a more informal levelthe first set of samples have random and arbitrary valuesAt some point by chance the random sample happens tobe from a high-density region of the true marginal pos-terior From this point forward the samples are from thetrue posteriors and may be used for estimation

Let m0 denote the point at which the samples are ap-proximately steady state Samples after m0 can be used toprovide both posterior means and credible regions of theposterior (as well as other statistics such as posteriorvariance) Posterior means are estimated by the arith-metic means of the samples

and

where M is the number of iterations Credible regions areconstructed as the centermost interval containing 95of the samples from m0 to M

To use MCMC to estimate posteriors of μ and σ 2 it isnecessary to sample from a normal and an inverse-gamma

distribution Sampling from a normal is provided in manysoftware packages and routines for low-level languagessuch as C or Fortran are readily available Samples froman inverse-gamma distribution may be obtained by tak-ing the reciprocal of samples from a gamma distributionAhrens and Dieter (1974 1982) provide algorithms forsampling the gamma distribution

To illustrate Gibbs sampling consider the case of es-timating IQ The hypothetical data for this example arefour replicates from a single participant w (112 106104 111) The classical estimation of μ is the samplemean (10825) and confidence intervals are constructedby multiplying the standard error (sw193) by the ap-propriate critical t value with three degrees of freedomFrom this multiplication the 95 confidence intervalfor μ is [1022 1144] Bayesian estimation was donewith Gibbs sampling Diffuse priors were placed on μ(μ0 0 σ 0

2 106) and σ 2 (a 106 b 106) Twochains of [μμ] are shown in the left panel of Figure 10each corresponding to a different initial value for [σ 2]1As can be seen both chains converge to their steady staterather quickly For this application there is very littleburn-in The histogram of the samples of [μμ] after burn-in is also shown (right panel) The histogram conforms toa t distribution with three degrees of freedom The poste-rior mean is 10826 and the 95 credible interval is[1022 1144] For normally distributed data the diffusepriors used in Bayesian analysis yield results that are nu-merically equivalent to their frequentist counterparts

The diffuse priors used in the previous case representthe situation in which researchers have no previous knowl-

σσ

2

2

0

0=

⎡⎣ ⎤⎦

minusgtsum

mm m

M m

μμ

=minus

gtsum [ ]m

m m

M m0

0

15

10

05

0

120

115

110

105

100

0 10 20 30 40 50 95 100 105 110 115 120IQ PointsIteration

Den

sity

IQ points

micro

Burn-in

Figure 10 Samples of μμ from Markov chain Monte Carlo integration for normally dis-tributed data The left panel shows two chains each from a poor choice of the initial value ofσσ 2 (the solid and dotted lines are for [σσ 2]1 10000 and [σσ 2]1 0001 respectively) Con-vergence is rapid The right panel shows the histogram of samples of μμ after burn-in Thedensity is the appropriate frequentist t distribution with three degrees of freedom

586 ROUDER AND LU

edge In the IQ example however much is known aboutthe distribution of IQ scores Suppose our test was normedfor a mean of 100 and a standard deviation of 15 Thisknowledge could be profitably used by setting μ0 100and σ 0

2 152 225 We may not know the particulartestndashretest correlation of our scales but we can surmisethat the standard deviations of replicates should be lessthan the population standard deviation After experiment-ing with the values of a and b we chose a 3 and b 100 The prior on σ 2 corresponding to these values isshown in Figure 11 It has much of its density below 100reflecting our belief that the testndashretest variability issmaller than the variability in the population Figure 11also shows a histogram of the posterior of μ (the posteriorwas obtained with Gibbs sampling as outlined above[σ 2]1 1 burn-in of 100 iterations) The posterior meanis 10795 which is somewhat below the sample mean of10825 This difference is from the prior people on aver-age score lower than the obtained scores Although it is abit hard to see in the figure the posterior distribution ismore narrow in the extreme tails than is the correspond-ing t distribution The 95 credible interval is [10211136] which is modestly different from the correspond-ing frequentist confidence interval This analysis shows

that the use of reasonable prior information can lead to aresult different from that of the frequentist analysis TheBayesian estimate is not only different but often more ef-ficient than the frequentist counterpart

The theoretical underpinnings of MCMC guaranteethat infinitely long chains will converge to the true pos-terior Researchers must decide how long to burn in chainsand then how long to continue in order to approximatethe posterior well The most important consideration inthis decision is the degree of correlation from one sam-ple to another In MCMC sampling consecutive samplesare not necessarily independent of one another The valueof a sample on cycle m 1 is in general dependent onthat of m If this dependence is small then convergencehappens quickly and relatively short chains are neededIf this dependence is large then convergence is muchslower One informal graphical method of assessing thedegree of dependence is to plot the autocorrelation func-tions of the chains The bottom right panel of Figure 11shows that for the normal application there appears to beno autocorrelation that is there is no dependence fromone sample to the next This fact implies that conver-gence is relatively rapid Good approximations may beobtained with short burn-ins and relatively short chains

Figure 11 Analysis with informative priors The two top panels showthe prior distributions on μμ and σσ 2 The bottom left panel shows the his-togram of samples of μμ after burn-in It differs from a t distribution fromfrequentist analysis in that it is shifted toward 100 and has a smallerupper tail The bottom right panel shows the autocorrelation functionfor μμ There is no discernible autocorrelation

HIERARCHICAL BAYESIAN MODELS 587

There are a number of formal procedures for assessingconvergence in the statistical literature (see eg Gel-man amp Rubin 1992 Geweke 1992 Raftery amp Lewis1992 see also Gelman et al 2004)

A PROBIT MODEL FOR ESTIMATING A PROBABILITY OF SUCCESS

The next step toward a hierarchical signal detectionmodel is to revisit estimation of a probability from indi-vidual trials Once again y successes are observed in Ntrials and y ~ Binomial(N p) where p is the parameterof interest In the previous section a beta prior was placedon parameter p and the conjugacy of the beta with the bi-nomial directly led to the posterior Unfortunately a betadistribution prior is not convenient as a prior for the sig-nal detection application

The signal detection model relies on the probit trans-form Figure 12 shows the probit transform which mapsprobabilities into z values using the normal inverse cu-mulative distribution function Whereas p ranges from 0to 1 z ranges across the real number line Let Φ denotethe standard normal cumulative distribution function Withthis notation p Φ(z) and conversely z Φ1( p)

Signal detection parameters are defined in terms ofprobit transforms

and

where p(h) and p(f ) are hit and false alarm probabilities re-spectively As a precursor to models of signal detectionwe develop estimation of the probit transform of a proba-bility of success z Φ1( p) The methods for doing sowill then be used in the signal detection application

When using a probit transform we model the y suc-cesses in N trials as

y ~ Binomial[N Φ(z)]

where z is the parameter of interestWhen using a probit it is convenient to place a normal

prior on z

(31)

Figure 13 shows various normal priors on z and by trans-form the corresponding prior on p Consider the specialcase when a standard normal is placed on z As is shownin the figure the corresponding prior on p is flat A flatprior also corresponds to a beta with a b 1 (see Fig-ure 6) From the previous development if a beta prior isplaced on p the posterior is also a beta with parametersaprime a y and bprime b (N y) Noting for a flat priorthat a b 1 it is expected that the posterior of p is dis-tributed as a beta with parameters aprime 1 y and bprime 1 (N y)

The next step is to derive the posterior f (z |Y ) for anormal prior on z The straightforward approach is tonote that the posterior is proportional to the likelihoodmultiplied by the prior

This equation is intractable Albert and Chib (1995) pro-vide an alternative for analysis and the details of this al-ternative are provided in Appendix A The basic idea isthat a new set of latent variables is defined upon which zis conditioned Conditional posteriors for both z andthese new latent variables are easily derived and sam-pled These samples can then be used in Gibbs samplingto find a marginal posterior for z

In this application it may be desirable to estimate theposterior of p as well as that of z This estimation is easyin MCMC Because p is the inverse probit transform ofz samples can also be inversely transformed Let [z] bethe chain of samples from the posterior of z A chain ofposterior samples of p is given by [ p]m Φ( [z]m)

The top right panel of Figure 14 shows the histogramof the posterior of p for y 7 successes on N 10 tri-als The prior parameters were μ0 0 and σ 0

2 1 so theprior on p was flat (see Figure 13) Because the prior isflat it corresponds to a beta with parameters (a 1 b 1) Consequently the expected posterior is a beta withdensity aprime y 1 8 and bprime (N y) 1 4 Thechains in the Gibbs sampler were started from a numberof values of [z]1 and convergence was always rapid Theresulting histogram for 10000 samples is plotted it ap-proximates the expected distribution The bottom panelshows a small degree of autocorrelation This autocorre-lation can be overcome by running longer chains andusing a more conservative burn-in period Chains of

f z Y z z

z

Y N Y( | ) ( ) [ ( )]

exp

Φ Φ1

20

2

02 1

minus

times minus minus( ) ( )minus

minusμ σ⎡⎡

⎣⎢⎤⎦⎥

z ~ Normal 0μ σ02( )

c p= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pΦ Φ1 1( ) ( )h f

Figure 12 Probit transform of probability ( p) into z scores (z)The transform is the inverse cumulative distribution function ofthe normal distribution Lines show the transform of p 69 intoz 05

588 ROUDER AND LU

1000 samples with 100 samples of burn-in are morethan sufficient for this application

ESTIMATING PROBABILITY FORSEVERAL PARTICIPANTS

In this section we propose a model for estimating sev-eral participantsrsquo probability of success The main goalis to introduce hierarchical models within the context ofa relatively simple example The methods and insightsdiscussed here are directly applicable to the hierarchicalmodel of signal detection Consider data from a numberof participants each with his or her own unique proba-bility of success pi In conventional estimation each ofthese probabilities is estimated separately usually by theestimator piyi Ni where yi and Ni are the number ofsuccesses and trials respectively for the ith participantWe show here that hierarchical modeling leads to moreaccurate estimation

In a hierarchical model it is assumed that although in-dividuals vary in ability the distribution governing indi-vidualsrsquo true abilities is orderly We term this distributionthe parent distribution If we knew the parent distributionthis prior information would be useful in estimationConsider the example in Figure 15 Suppose 10 individ-uals each performed 10 trials Hypothetical performanceis indicated with Xs Letrsquos suppose the task is relativelyeasy so that individualsrsquo true probabilities of successrange from 7 to 10 as indicated by the curve in the fig-

ure One participant does poorly however succeeding ononly 3 trials The conventional estimate for this poor-performing participant is 3 which is a poor estimate ofthe true probability (between 7 and 10) If the parentdistribution is known it would be evident that this poorscore is influenced by a large degree of sample noise andshould be corrected upward as indicated by the arrowThe new estimate which is more accurate is denoted byan H (for hierarchical)

In the example above we assumed a parent distribu-tion In actuality such assumptions are unwarranted Inhierarchical models parent distributions are estimatedfrom the data these parents in turn affect the estimatesfrom outlying data The estimated parent for the data inFigure 15 would have much of its mass in the upperranges When serving as a prior the estimated parentwould exert an upward tendency in the estimation of theoutlying participantrsquos probability leading to better esti-mation In Bayesian analysis the method of implement-ing a hierarchical model is to use a hierarchical priorThe parent distribution serves as a prior for each indi-vidual and this parent is termed the first stage of theprior The parent distribution is not fully specified butinstead free parameters of the parent are estimated fromthe data These parameters also need a prior on themwhich is termed the second stage of the prior

The probit transform model is ideally suited for a hi-erarchical prior A normally distributed parent may beplaced on transformed probabilities of success The ith

Prior 1 Prior 2 Prior 3

Priorson z

Corresponding Priors on p

N(01) N(11) N(0 2)4

3

2

1

0

4

3

2

1

0

4

3

2

1

0

ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6

5

4

3

2

1

0

5

4

3

2

1

0

5

4

3

2

1

0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Figure 13 Three normally distributed priors on z and the corresponding priors on p Priors on z were obtainedby drawing 100000 samples from the normal Priors on p were obtained by transforming each sample of z

HIERARCHICAL BAYESIAN MODELS 589

participantrsquos true ability zi is simply a sample from thisnormal The number of successes is modeled as

(32)

with a parent distribution (first-stage prior) given by

(33)

In this case parameters μ and σ 2 describe the locationand variance of the parent distribution and reflect group-level properties Priors are needed on μ and σ 2 (second-stage priors) Suitable choices are

(34)

andσ 2 ~ Inverse Gamma (a b) (35)

Values of μ0 σ 02 a and b must be specif ied before

analysisWe performed an analysis on hypothetical data in order

to demonstrate the hierarchical model Table 2 shows hy-pothetical data for a set of 20 participants performing 50trials These success frequencies were generated from abinomial but each individual had a unique true proba-bility of success These true probabilities2 which rangedbetween 59 and 80 are shown in the second column ofTable 2 Parameter estimators p0 (Equation 12) and p2(Equation 14) are shown

All of the theoretical results needed to derive the con-ditional posterior distributions have already been pre-

sented The Albert and Chib (1995) analysis for this ap-plication is provided in Appendix B A coded version ofMCMC for this model is presented in Appendix C

The MCMC chain was run for 10000 iterations Pri-ors were set to be fairly diffuse (μ0 0 σ 0

2 1000 a b 1) All parameters converged to steady-state behav-ior quickly (burn-in was conservative at 100 iterations)Autocorrelation of samples of pi for 2 select participantsis shown in the bottom panels of Figure 16 The auto-correlation in these plots was typical of other param-eters Autocorrelation for this model was no worse thanthat for the nonhierarchical Bayesian model of the pre-ceding section The top left panel of Figure 16 showsposterior means of the probability of success as a func-tion of the true probability of success for the hierarchi-cal model (filled circles) Also included are classical estimates from p0 (filled circles) Points from the hier-archical model are on average closer to the diagonalAs is indicated in Table 2 the hierarchical model haslower root mean squared error than does its frequentistcounterpart p0 The top right panel shows the hierarchi-cal estimates as a function of p0 The effect of the hier-archical structure is clear The extreme estimates in thefrequentist approach are moderated in the Bayesiananalysis

Estimator p2 also moderates extreme values The ten-dency is to pull them toward the middle specifically es-timates are closer to 5 than they are with p0 For the dataof Table 2 the difference in efficiency between p2 andp0 is exceedingly small (about 0001) The reason thatthere was little gain for p2 is that it pulled estimates to-ward a value of 5 which is not the mean of the trueprobabilities The mean was p 69 and for this valueand these sample sizes estimators p2 and p0 have aboutthe same efficiency The hierarchical estimator is supe-

μ μ σ~ Normal 02( )

zi | μ σ μ σ2 ind~ Normal 2( )

y z N zi i i i| ind~ Binomial Φ( )⎡⎣ ⎤⎦

Figure 14 (Top) Histograms of the post-burn-in samples of zand p respectively for 7 successes out of 10 trials The line fit forp is the true posterior a beta distribution with a primeprime 8 and bprimeprime 4(Bottom) Autocorrelation function for z There is a small degreeof autocorrelation

Figure 15 Example of how a parent distribution affects esti-mation The outlying observation is incompatible with the parentdistribution Bayesian estimation allows for the influence of theparent as a prior The resulting estimate is shifted upward re-ducing error

590 ROUDER AND LU

rior to these nonhierarchical estimators because it pullsoutlying estimates toward the mean value p Of coursethe advantage of the hierarchical model is a function ofsample size With increasing sample sizes the amountof pulling decreases In the limit all three estimatorsconverge to the true individualrsquos probability

A HIERARCHICAL SIGNAL DETECTION MODEL

Our ultimate goal is a signal detection model that ac-counts for both participant and item variability As a pre-cursor we present a hierarchical signal detection modelwithout item variability the model will be expanded inthe next section to include item variability The modelhere is applicable to psychophysical experiments in whichsignal and noise trials are identical replicates respectively

Signal detection parameters are simple subtractions ofprobit-transformed probabilities Let d primei and ci denotesignal detection parameters for the ith participant

and

where pi(h) and p

i(f ) are individualsrsquo true hit and false

alarm probabilities respectively Let hi and fi be the pro-bit transforms hi1( p i

(h)) and fi1( pi(f )) Then

sensitivity and criteria are given by

d primei hi fiand

ci fi

The tactic we follow is to place hierarchical models on hand f rather than on signal detection parameters directlyBecause the relationship between signal detection pa-rameters and probit-transformed probabilities is linearaccurate estimation of hi and fi results in accurate esti-mation of d primei and ci

The model is developed as follows Let Ni and Si de-note the number of noise and signal trials given to the ithparticipant Let y

i(h) and y

i(f ) denote the the number of

hits and false alarms respectively The model for yi(h) and

yi(f ) is given by

and

It is assumed that each participantrsquos parameters aredrawn from parent distributions These parent distribu-tions serve as the first stage of the hierarchical prior andare given by

(36)

and

(37)

Second-stage priors are placed on parent distribution pa-rameters (μk σ k

2) as

(38)

and

(39)σ k k ka b k2 ~ Inverse Gamma h f( ) =

μk k ku v k~ Normal h f( ) =

f iind~ Normal f fμ σ 2( )

hiind~ Normal h hμ σ 2( )

y f N fi i i i( ) | f ind

~ Binomial Φ( )⎡⎣ ⎤⎦

y h S hi i i i( ) | h ind

~ Binomial Φ( )⎡⎣ ⎤⎦

c pi i= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pi i iΦ Φ1 1( ) ( )h f

Table 2Data and Estimates of a Probability of Success

Participant True Prob Data p0 p2 Hierarchical

1 587 33 66 654 6722 621 28 56 558 6153 633 34 68 673 6834 650 29 58 577 6275 659 32 64 635 6606 665 37 74 731 7167 667 30 60 596 6388 667 29 58 577 6279 687 34 68 673 684

10 693 34 68 673 68211 696 34 68 673 68312 701 37 74 731 71513 713 38 76 750 72814 734 38 76 750 72715 736 37 74 731 71716 738 33 66 654 67217 743 37 74 731 71618 759 33 66 654 67219 786 38 76 750 72820 803 42 84 827 771

RMSE 053 053 041Data are numbers of successes in 50 trials

HIERARCHICAL BAYESIAN MODELS 591

Analysis takes place in two steps The first is to obtainposterior samples from hi and fi for each participantThese are probit-transformed parameters that may be es-timated following the exact procedure in the previoussection Separate runs are made for signal trials (produc-ing estimates of hi) and for noise trials (producing esti-mates of fi) These posterior samples are denoted [hi] and[ fi] respectively The second step is to use these posteriorsamples to estimate posterior distributions of individuald primei and ci This is easily accomplished For all samples

(40)

and(41)

Hypothetical data and parameter estimates are shownin Table 3 Individual variability in the data was imple-mented by sampling each participantrsquos d prime and c from alog-normal and from a normal distribution respectivelyTrue d prime values are shown and they range from 123 to234 From these true values of d prime and c true individualhit and false alarm probabilities were computed and in-dividual hit and false alarm frequencies were sampledfrom a binomial Hit and false alarm frequencies wereproduced in this manner for 20 hypothetical individualseach observing 50 signal and 50 noise trials

A conventional sensitivity estimate for each individ-ual was obtained using the formula

Φ1(Hit rate) Φ1(False alarm rate)

The resulting estimates are shown in the last column ofTable 3 labeled Conventional The hierarchical signaldetection model was analyzed with diffuse prior param-eters ah af bh bf 01 uh uf 0 and vh vf 1000 The chain was run for 10000 iterations with thefirst 500 serving as burn-in This choice of burn-in pe-riod is very conservative Autocorrelation for sensitivityfor 2 select participants is displayed in the bottom pan-els of Figure 17 this degree of autocorrelation was typ-ical for all parameters Autocorrelation was about thesame as in the previous application and the resulting es-timates are shown in the next-to-last column of Table 3

The top left panel of Figure 17 shows a plot of esti-mated dprime as a function of true d prime Estimates from the hi-erarchical model are closer to the diagonal than are theconventional estimates indicating more accurate esti-mation Table 3 also provides efficiency information thehierarchical model is 33 more efficient than the con-ventional estimates (this amount is typical of repeatedruns of the same simulation) The reason for this gain isevident in the top right panel of Figure 17 The extremeestimates in the conventional method which most likely

c fi m i m⎡⎣ ⎤⎦ = minus ⎡⎣ ⎤⎦

prime⎡⎣ ⎤⎦ = ⎡⎣ ⎤⎦ minus ⎡⎣ ⎤⎦d h fi m i m m

ConventionalHierarchical

Participant 5 Participant 15

True Probability Conventional Estimates

Lag Lag

5 6 7 8 9 5 6 7 8 9

0 10 20 30 40 0 10 20 30 40

Aut

ocor

rela

tion

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

9

8

7

6

5

8

4

0

9

8

7

6

5

8

4

0

Figure 16 Modeling the probability of success across several partici-pants (Top left) Probability estimates as functions of true probabilitythe triangles represent estimator p0 and the circles represent posteriormeans of pi from the hierarchical model (Top right) Hierarchical esti-mates as a function of p0 (Bottom) Autocorrelation functions of the pos-terior probability estimates for 2 participants from the hierarchicalmodel The degree of autocorrelation is typical of the other participants

592 ROUDER AND LU

reflect sample noise are shrunk to more moderate esti-mates in the hierarchical model

SIGNAL DETECTION WITH ITEMVARIABILITY

In the introduction we stressed that Bayesian hierar-chical modeling may provide a solution for the statisti-cal problems associated with unmodeled variability innonlinear contexts In all of the previous examples thegains from hierarchical modeling were in better effi-ciency Conventional analysis although not as accurateas Bayesian analysis certainly yielded consistent if notunbiased estimates In this section item variability isadded to signal detection Conventional analyses inwhich scores are aggregated over items result in incon-sistent estimates characterized by asymptotic bias Inthis section we show that in contrast to conventionalanalysis Bayesian hierarchical models provide consis-tent and accurate estimation

Our route to the model is a bit circuitous Our first at-tempt based on a straightforward extension of the pre-vious techniques fails because of an excessive amountof autocorrelation The solution to this failure is to use apowerful and recently developed sampling schemeblocked sampling (Roberts amp Sahu 1997) First we willproceed with the straightforward but problematic exten-sion Then we will introduce blocked sampling and showhow it leads to tractable analysis

Straightforward ExtensionIt is straightforward to specify models that include

item effects The previous notation is here expanded byindexing hits and false alarms by both participants and

items Let yij(h) and yij

(f ) denote the response of the ith per-son to the jth item Values of yij

(h) and yij(f ) are dichoto-

mous indicating whether the response was old or newLet pij

(h) and pij(f ) be true probabilities that yij

(h) 1 andyij

(f ) 1 respectively that is these probabilities are truehit and false alarm rates for a given participant-by-itemcombination respectively Let hij and fij be the probittransforms [hij Φ1( pij

(h)) fij Φ1( pij(f ))] The model

is given by

(42)

and

(43)

Linear models are placed on hij and fij

hij μh αi γj (44)

and

fij μf βi δj (45)

Parameters μh and μf correspond to the grand means ofthe hit and false alarm rates respectively Parameters ααand ββ denote participant effects on hits and false alarms respectively parameters γγ and δδ denote item effects onhits and false alarms respectively These participant anditem effects are assumed to be zero-centered random ef-fects Grand mean signal detection parameters denotedd prime and c are given by

d prime μh μf (46)

and

c μf (47)

y fij ij( ) f ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

y hij ij( )h ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

Table 3Data and Estimates of Sensitivity

FalseParticipant True d prime Hits Alarms Hierarchical Conventional

1 1238 35 7 1608 16052 1239 38 17 1307 11193 1325 38 11 1554 14784 1330 33 16 1146 08805 1342 40 19 1324 11476 1405 39 8 1730 17677 1424 37 12 1466 13508 1460 27 5 1417 13829 1559 35 9 1507 1440

10 1584 37 9 1599 155911 1591 33 8 1486 140712 1624 40 7 1817 192213 1634 35 11 1427 129714 1782 43 13 1687 172415 1894 33 3 1749 196716 1962 45 12 1839 198817 1967 45 4 2235 268718 2124 39 6 1826 194719 2270 45 5 2172 256320 2337 46 6 2186 2580

RMSE 183 275Data are numbers of signal responses in 50 trials

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 8: Rouder & Lu (2005)

580 ROUDER AND LU

ment presented here for the following reason The pos-terior is a function of the parameter of interest p butf (y) does not depend on p The term f (y) is a constantwith respect to p and serves as a normalizing constantthat ensures that the posterior density integrates to 1 Aswill be shown in the examples the value of this normal-izing constant usually becomes apparent in analysis

To perform Bayesian analysis the researcher mustchoose a prior distribution For this application of esti-mating a probability from binomially distributed datathe beta distribution serves as a suitable prior distribu-tion for p The beta is a flexible two-parameter distribu-tion see Figure 6 for examples In contrast to the normaldistribution which has nonzero density on all numbersthe beta has nonzero density between 0 and 1 The betadistribution is a function of two parameters a and b thatdetermine its shape The notation for specifying a betadistribution prior for p is

p ~ Beta(a b)

The density of a beta random variable is given by

(18)

The denominator Be(a b) is termed the beta function1

Like f (y) it is not a function of p and plays the role of anormalizing constant in analysis

In practice researchers must completely specify theprior distribution before analysis that is they must choosesuitable values for a and b This choice reflects re-searchersrsquo beliefs about the possible values of p beforedata are collected When a 1 and b 1 (see the mid-

dle panel of Figure 6) the beta distribution is flat that isthere is equal density for all values in the interval [0 1]By choosing a b 1 the researcher is committing toall values of p being equally likely before data are col-lected Researchers can choose values of a and b to bestmatch their beliefs about the expected data For exam-ple consider a psychophysical experiment in which thevalue of p will almost surely be greater than 5 For thisexperiment priors with a b are appropriate

The goal is to derive the posterior distribution usingBayesrsquos theorem (Equation 15) Substituting the likeli-hood function (Equation 17) and prior (Equation 18) intoBayesrsquos theorem yields

Collecting terms in p yields

f ( p | y) kp (ya1)(1p)(Nyb1) (19)

where

The term k is constant with respect to p and serves as anormalizing constant Substituting

aprime y a bprime N y b (20)

into Equation 19 yields

f ( p | y) kp aprime1(1p)bprime1

This equation is proportional to a beta density for pa-rameters aprime and bprime To make f ( p | y) a proper probabilitydensity k has a value such that f ( p | y) integrates to 1

k a b f yN

y= times

⎛⎝⎜

⎞⎠⎟

minus[ ( ) ( )] Be 1

f p yN

y

p p p pa b

y N y a b

( | )( ) ( )

( =

⎛⎝⎜

⎞⎠⎟

minus minusminus minus minus1 11 1

Be )) ( )

times f y

f pp p

a b

a b

( )( )

( )= minusminus minus1 11

Be

Figure 5 Bias and root mean squared error (RMSE) for thethree estimators as functions of true probability of success Thesolid dashed and dasheddotted lines denote the characteristicsof p0 p1 and p2 respectively

Figure 6 Probability density function of the beta distributionfor various values of parameters a and b

HIERARCHICAL BAYESIAN MODELS 581

This occurs only if the value of k is a beta functionmdashforexample k 1Be(a prime bprime) Consequently

(21)

The posterior like the prior is a beta but with param-eters aprime and bprime given by Equation 20

The derivation above was greatly facilitated by sepa-rating those terms that depend on the parameter of inter-est from those that do not The latter terms serve as anormalizing constant and can be computed by ensuringthat the posterior integrates to 10 In most derivationsthe value of the normalizing constant becomes apparentmuch as it did above Consequently it is convenient toconsider only those terms that depend on the parameterof interest and to lump the rest into a proportionality con-stant Hence a convenient form of Bayesrsquos theorem is

Posterior distribution Likelihood function Prior distribution

The symbol ldquordquo denotes proportionality and is read isproportional to This form of Bayesrsquos theorem is usedthroughout this article

Figure 7 shows both the prior and posterior of p forthe case of y 7 successes in N 10 trials The priorcorresponds to a beta with a b 1 and the observeddata are 7 successes in 10 trials The posterior in thiscase is a beta distribution with parameters aprime 8 andbprime 4 As can be seen the posterior is considerablymore narrow than the prior indicating that a large degreeof information has been gained from the data There aretwo valuable quantities derived from the posterior distri-bution the posterior mean and the 95 highest densityregion (also termed the 95 credible interval) The for-mer is the point estimate of p and the latter serves anal-ogously to a confidence interval These two quantitiesare also shown in Figure 7

For the case of a beta-distributed posterior the ex-pression for the posterior mean is

The posterior mean reduces to p1 if a b 5 it reducesto p2 if a b 1 Therefore estimators p1 and p2 aretheoretically justified from a Bayesian perspective Theestimator p0 may be obtained for values of a b 0 Forthese values however the integral of the prior distribu-tion is infinite If the integral of a distribution is infinitethe distribution is termed improper Conversely if the integral is finite the distribution is termed proper InBayesian analysis it is essential that posterior distribu-tions be proper However it is not necessary for the priorto be proper In some cases an improper prior will stillyield a proper posterior When this occurs the analysisis perfectly valid For estimation of p the posterior isproper even when a b 0

When the prior and posterior are from the same distri-bution the prior is termed conjugate The beta distribu-tion is the conjugate prior for binomially distributed databecause the resulting posterior is also a beta distributionConjugate priors are convenient they often facilitate thederivation of posterior distributions Conjugacy howeveris not at all necessary for Bayesian analysis A discussionabout conjugacy is provided in the General Discussion

BAYESIAN ESTIMATION WITH NORMALLY DISTRIBUTED DATA

In this section we present Bayesian analysis for nor-mally distributed data The normal plays a critical role inthe hierarchical signal detection model We will assumethat participant and item effects on sensitivity and biasare distributed as normals The results and techniquesdeveloped in this section are used directly in analyzingthe subsequent hierarchical signal detection models

Consider the case in which a sequence of observa-tions w (w1 wN) are independent and identicallydistributed as normals

The goal is to estimate parameters μ and σ 2 In this sec-tion we discuss a more limited goalmdashthe estimation ofone of the parameters assuming that the other is knownIn the next section we will introduce a form of Markovchain Monte Carlo sampling for estimation when bothparameters are unknown

Estimating μμ With Known σσ 2

Assume that σ 2 is known and that the estimation of μis of primary interest The goal then is to derive the pos-

wjind~ Normal μ σ 2( )

p aa b

y aN a b

= primeprime + prime

= ++ +

f p yp p

a b

a b

( | )( )( )

= minusprime prime

primeminus primeminus1 11Be

Figure 7 Prior and posterior distributions for 7 successes outof 10 trials Both are distributed as betas Parameters for theprior are a 1 and b 1 parameters for the posterior are aprimeprime 8 and b primeprime 4 The posterior mean and 95 highest density regionare also indicated

582 ROUDER AND LU

terior f (μ |σ 2 w) According to the proportional formof Bayesrsquos theorem

All terms are conditioned on σ 2 to reflect the fact that itis known

The first step is to choose a prior distribution for μThe normal is a suitable choice

Parameters μ0 and σ 02 must be specified before analysis

according to the researcherrsquos beliefs about μ By choos-ing σ 0

2 to be increasingly large the researcher can makethe prior arbitrarily variable In the limit the prior ap-proaches having equal density across all values Thisprior is considered noninformative for μ

The density of the prior is given by

Note that the prior on μ does not depend on σ 2 there-fore f (μ) f (μ |σ2)

The next step is multiplying the likelihood f (w | μσ 2) by the prior density f (μ |σ 2) For normally distrib-uted data the likelihood is given by

(22)

Multiplying the equation above by the prior yields

We concern ourselves with only those terms that involveμ the parameter of interest Other terms are used to nor-malize the posterior density and are absorbed into theconstant of proportionality

where

(23)

and

(24)

The posterior is proportional to the density of a normalwith mean and variance given by μprime and σ 2prime respectivelyBecause the conditional posterior distribution must inte-grate to 10 this distribution must be that of a normalwith mean μprime and variance σ 2prime

(25)Unfortunately the notation does not make it clear thatboth μprime and σ 2prime depend on the value of σ 2 When this de-pendence is critical it will be made explicit μprime(σ 2) andσ 2prime(σ 2)

Because the posterior and prior are from the samefamily the normal distribution is the conjugate prior forthe population mean with normally distributed data (withknown variance) In the limit that the prior is diffuse (ieas its variance is made increasingly large) the posteriorof μ is a normal with μprime w and σ 2prime σ 2N This dis-tribution is also the sampling distribution of the samplemean in classical statistics Hence for the diffuse priorthe Bayesian and conventional approaches yield the sameresults

Figure 8 shows the estimation for an example in whichthe data are w (112 106 104 111) and the variance isassumed known to be 16 The mean of these four obser-vations is 10825 For demonstration purposes the prior

μ σ μ σ| ~ 2 2w Normal prime( )prime

prime = +⎛⎝⎜

⎞⎠⎟

minus minus primesumμ σ σ μ σ20

20

2w jj

σ σ σprime minus minus minus= +( )2 2

02 1

N

f μ σμ μ

σ| exp 2

2

22w( ) minus minus prime( )⎛

⎝⎜⎜

⎠⎟⎟prime

f N( | ) expμ σσ σ

μ22

02

212

1w minus +⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

⎣⎢⎢

⎝⎜

minusminussum

+⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

+sum

+⎧⎨

22

0

02

2

202

02

w

w

jj

jj

σμσ

μ

σμσ

⎪⎪

⎩⎪

⎫⎬⎪

⎭⎪

⎦⎥⎥

⎠⎟

fw wj jjμ σ

μ μ

σ

μ

| exp22 2

212

2w( ) minus

minus +( )⎡

⎣⎢⎢

⎝⎜⎜

+

sum

220 0

2

02

2minus + ⎤

⎦⎥⎥

⎠⎟

μ μ μσ

fw j

j

μ σμ

σ

μ μ

| exp

exp

2

2

2

0

2w( ) minus minus( )⎛

⎜⎜

⎟⎟

timesminus minus

sum

(( )⎛

⎝⎜⎜

⎠⎟⎟

2

022σ

fw

N

j

j

μ σπσ

μ

σ| exp

2

2 2

2

21

2 2w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

timesminus minus( )⎛

⎝⎜⎜

⎠⎟⎟

1

2 20

0

2

02π σ

μ μ

σexp

fw

N

j

j

w | exp

μ σπσ

μ

σ2

2 2

2

21

2 2( ) =

( )minus minus( )⎛

⎜⎜

⎟sum ⎟⎟

f ( ) exp μπσ

μ μσ

=minus minus( )⎛

⎝⎜⎜

⎠⎟⎟

1

2 20

0

2

02

μ μ σ~ Normal 0 02( )

f f fμ σ μ σ μ σ| | | 2 2 2w w( ) ( ) ( )

HIERARCHICAL BAYESIAN MODELS 583

is constructed to be fairly informative with μ0 100and σ 0

2 100 The posterior is narrower than the priorreflecting the information gained from the data It is cen-tered at μprime 1079 with variance 385 Note that theposterior mean is not the sample mean but is modestlyinfluenced by the prior Whether the Bayesian or the con-ventional estimate is better depends on the accuracy ofthe prior For cases in which much is known about thedependent measure it may be advantageous to reflectthis information in the prior

Estimating σσ 2 With Known μμIn the last section the posterior for μ was derived for

known σ 2 In this section it is assumed that μ is knownand that the estimation of σ 2 is of primary interest Thegoal then is to estimate the posterior f (σ 2 | μ w) The inverse-gamma distribution is the conjugate prior for σ 2

with normally distributed data The inverse-gamma dis-tribution is not widely used in psychology As its nameimplies it is related to the gamma distribution If a ran-dom variable g is distributed as a gamma then 1g is dis-tributed as an inverse gamma An inverse-gamma prioron σ 2 has density

The parameters of the inverse gamma are a and b and inapplication these are chosen beforehand As a and b ap-proach 0 the inverse gamma approaches 1σ 2 which isconsidered the appropriate noninformative prior for σ 2

(Jeffreys 1961)

Multiplying the inverse-gamma prior by the likelihoodgiven in Equation 22 yields

Collecting only those terms dependent on σ 2 yields

where

aprime N2 a (26)and

(27)

The posterior is proportional to the density of an inversegamma with parameters aprime and bprime Because this posteriorintegrates to 10 it must also be an inverse gamma

σ 2 | μ w ~ Inverse Gamma (aprime bprime) (28)

Note that posterior parameter bprime depends explicitly onthe value of μ

MARKOV CHAIN MONTE CARLO SAMPLING

The preceding development highlights a problem Theuse of conjugate priors allowed for the derivation of pos-teriors only when some of the parameters were assumedto be known The posteriors in Equations 25 and 28 areknown as conditional posteriors because they depend onother parameters The quantities of primary interest arethe marginal posteriors μ | w and σ 2 | w The straightfor-

prime = minus( )⎡⎣⎢

⎤⎦⎥

+sumb w bj j μ2

2

f ba

σ μσ σ

2

2 1 21| exp w( )

( )minus prime⎛

⎝⎜⎞⎠⎟prime+

f

w b

N a

jj

σ μσ

μ

σ

2

2 2 1

2

2

1

5

|

exp

w( )

( )times minus

minus( ) +⎛

+ +

sum

⎝⎝

⎜⎜

⎟⎟

fw

N

j

j

σ μσ

μ

σ2

2 2

2

21

2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟⎟

times

sum

112 1 2

σ σ( )minus⎛

⎝⎜⎞⎠⎟+a

bexp

fw

N

j

j

σ μπσ

μ

σ2

2 2

2

21

2 2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

times( )

minus⎛⎝⎜

⎞⎠⎟+

b

a

ba

aΓ( )

exp σ σ2 1 2

f b

a

b a ba

σ σσ2

2 1 22 0( ) =

( )minus⎛

⎝⎜⎞⎠⎟

ge+

Γ( )exp

Figure 8 Prior and posterior distributions for estimating themean of normally distributed data with known variance Theprior and posterior are the dashed and solid distributions re-spectively The vertical line shows the sample mean The priorldquopullsrdquo the posterior slightly downward in this case

584 ROUDER AND LU

ward method of obtaining marginals is to integrate conditionals

In this case the integral is tractable but in many cases itis not When an integral is symbolically intractable how-ever it can often still be evaluated numerically One ofthe recent developments in numeric integration is Markovchain Monte Carlo (MCMC) sampling MCMC sam-pling has become quite popular in the last decade and isdescribed in detail in Gelfand and Smith (1990) We showhere how MCMC can be used to estimate the marginalposterior distributions of μ and σ 2 without assumingknown values We use a particular type of MCMC sam-pling known as Gibbs sampling (Geman amp Geman 1984)Gibbs sampling is a restricted form of MCMC the moregeneral form is known as MetropolisndashHastings MCMC

The goal is to find the marginal posterior distribu-tions MCMC provides a set of random samples from themarginal posterior distributions (rather than a closed-form derivation of the posterior density or cumulativedistribution function) Obtaining a set of random sam-ples from the marginal posteriors is sufficient for analy-sis With a sufficiently large sample from a random vari-able one can compute its density mean quantiles andany other statistic to arbitrary precision

To explain Gibbs sampling we will digress momen-tarily using a different example Consider the case ofgenerating samples from an ex-Gaussian random vari-able The ex-Gaussian is a well-known distribution incognitive psychology (Hohle 1965) It is the sum of nor-mal and exponential random variables The parametersare μ σ and τ the location and scale of the normal andthe scale of the exponential respectively An ex-Gaussianrandom variable x can be expressed in conditional form

andη ~ Exponential(τ )

We can take advantage of this conditional form to gen-erate ex-Gaussian samples First we generate a set ofsamples from η Samples will be denoted with squarebrackets [η]1 denotes the first sample [η]2 denotes thesecond and so on These samples can then be used in theconditional form to generate samples from the marginalrandom variable x For example the first two samples aregiven by

and

In this manner a whole set of samples from the mar-ginal x can be generated from the conditional randomvariable x |η Figure 9 shows that the method works Ahistogram of [x] generated in this manner converges tothe true ex-Gaussian density This is an example of MonteCarlo integration of a conditional density

We now return to the problem of deriving marginalposterior distributions of μ | w from the conditionalμ |σ 2 w If there was a set of samples from σ 2 | w MonteCarlo integration could be used to sample μ | w Like-wise if there was a set of samples of μ | w Monte Carlointegration could be used to sample σ 2 | w In Gibbs sam-pling these relations are used iteratively as followsFirst a value of [σ 2]1 is picked arbitrarily This value isthen used to sample the posterior of μ from the condi-tional in Equation 25

where μprime and σ 2 prime are the explicit functions of σ 2 givenin Equations 23 and 24 Next the value of [μ]1 is used tosample σ 2 from the conditional posterior in Equation 28In general for iteration m

(29)

and

(30)

In this manner sequences of random samples [μμ] and[σσ 2] are obtained

The initial samples of [μμ] and [σσ 2] certainly reflect thechoice of starting values [σ 2]1 and are not samples fromthe desired marginal posteriors It can be shown how-ever that under mild technical conditions the later sam-ples of [μμ] and [σσ 2] are samples from the desired mar-ginal posteriors (Tierney 1994) The initial region in

σ μ21

⎡⎣ ⎤⎦ prime prime ⎡⎣ ⎤⎦( )( )minusm ma b~ Inverse Gamma

[ ] μ μ σ σ σm m m~ Normal prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime2 2 2

μ μ σ σ σ⎡⎣ ⎤⎦ prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime1

2 2~ Normal1

2

1

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )2 22~ Normal +μ η σ

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )1 12~ Normal +μ η σ

x | ~ η μ η σNormal +( )2

f f f d( | ) | | μ μ σ σ σw w w= ( ) ( )int 2 2 2

Figure 9 An example of Monte Carlo integration of a normalconditional density to yield ex-Gaussian distributed samples Thehistogram presents the samples from a normal whose mean pa-rameter is distributed as an exponential The line is the densityof the appropriate ex-Gaussian distribution

HIERARCHICAL BAYESIAN MODELS 585

which the samples reflect the starting value is termed theburn-in period The later samples are the steady-state pe-riod Formally speaking the samples form irreducibleMarkov chains with stationary distributions that equalthe marginal posterior distribution (see Gilks Richard-son amp Spiegelhalter 1996) On a more informal levelthe first set of samples have random and arbitrary valuesAt some point by chance the random sample happens tobe from a high-density region of the true marginal pos-terior From this point forward the samples are from thetrue posteriors and may be used for estimation

Let m0 denote the point at which the samples are ap-proximately steady state Samples after m0 can be used toprovide both posterior means and credible regions of theposterior (as well as other statistics such as posteriorvariance) Posterior means are estimated by the arith-metic means of the samples

and

where M is the number of iterations Credible regions areconstructed as the centermost interval containing 95of the samples from m0 to M

To use MCMC to estimate posteriors of μ and σ 2 it isnecessary to sample from a normal and an inverse-gamma

distribution Sampling from a normal is provided in manysoftware packages and routines for low-level languagessuch as C or Fortran are readily available Samples froman inverse-gamma distribution may be obtained by tak-ing the reciprocal of samples from a gamma distributionAhrens and Dieter (1974 1982) provide algorithms forsampling the gamma distribution

To illustrate Gibbs sampling consider the case of es-timating IQ The hypothetical data for this example arefour replicates from a single participant w (112 106104 111) The classical estimation of μ is the samplemean (10825) and confidence intervals are constructedby multiplying the standard error (sw193) by the ap-propriate critical t value with three degrees of freedomFrom this multiplication the 95 confidence intervalfor μ is [1022 1144] Bayesian estimation was donewith Gibbs sampling Diffuse priors were placed on μ(μ0 0 σ 0

2 106) and σ 2 (a 106 b 106) Twochains of [μμ] are shown in the left panel of Figure 10each corresponding to a different initial value for [σ 2]1As can be seen both chains converge to their steady staterather quickly For this application there is very littleburn-in The histogram of the samples of [μμ] after burn-in is also shown (right panel) The histogram conforms toa t distribution with three degrees of freedom The poste-rior mean is 10826 and the 95 credible interval is[1022 1144] For normally distributed data the diffusepriors used in Bayesian analysis yield results that are nu-merically equivalent to their frequentist counterparts

The diffuse priors used in the previous case representthe situation in which researchers have no previous knowl-

σσ

2

2

0

0=

⎡⎣ ⎤⎦

minusgtsum

mm m

M m

μμ

=minus

gtsum [ ]m

m m

M m0

0

15

10

05

0

120

115

110

105

100

0 10 20 30 40 50 95 100 105 110 115 120IQ PointsIteration

Den

sity

IQ points

micro

Burn-in

Figure 10 Samples of μμ from Markov chain Monte Carlo integration for normally dis-tributed data The left panel shows two chains each from a poor choice of the initial value ofσσ 2 (the solid and dotted lines are for [σσ 2]1 10000 and [σσ 2]1 0001 respectively) Con-vergence is rapid The right panel shows the histogram of samples of μμ after burn-in Thedensity is the appropriate frequentist t distribution with three degrees of freedom

586 ROUDER AND LU

edge In the IQ example however much is known aboutthe distribution of IQ scores Suppose our test was normedfor a mean of 100 and a standard deviation of 15 Thisknowledge could be profitably used by setting μ0 100and σ 0

2 152 225 We may not know the particulartestndashretest correlation of our scales but we can surmisethat the standard deviations of replicates should be lessthan the population standard deviation After experiment-ing with the values of a and b we chose a 3 and b 100 The prior on σ 2 corresponding to these values isshown in Figure 11 It has much of its density below 100reflecting our belief that the testndashretest variability issmaller than the variability in the population Figure 11also shows a histogram of the posterior of μ (the posteriorwas obtained with Gibbs sampling as outlined above[σ 2]1 1 burn-in of 100 iterations) The posterior meanis 10795 which is somewhat below the sample mean of10825 This difference is from the prior people on aver-age score lower than the obtained scores Although it is abit hard to see in the figure the posterior distribution ismore narrow in the extreme tails than is the correspond-ing t distribution The 95 credible interval is [10211136] which is modestly different from the correspond-ing frequentist confidence interval This analysis shows

that the use of reasonable prior information can lead to aresult different from that of the frequentist analysis TheBayesian estimate is not only different but often more ef-ficient than the frequentist counterpart

The theoretical underpinnings of MCMC guaranteethat infinitely long chains will converge to the true pos-terior Researchers must decide how long to burn in chainsand then how long to continue in order to approximatethe posterior well The most important consideration inthis decision is the degree of correlation from one sam-ple to another In MCMC sampling consecutive samplesare not necessarily independent of one another The valueof a sample on cycle m 1 is in general dependent onthat of m If this dependence is small then convergencehappens quickly and relatively short chains are neededIf this dependence is large then convergence is muchslower One informal graphical method of assessing thedegree of dependence is to plot the autocorrelation func-tions of the chains The bottom right panel of Figure 11shows that for the normal application there appears to beno autocorrelation that is there is no dependence fromone sample to the next This fact implies that conver-gence is relatively rapid Good approximations may beobtained with short burn-ins and relatively short chains

Figure 11 Analysis with informative priors The two top panels showthe prior distributions on μμ and σσ 2 The bottom left panel shows the his-togram of samples of μμ after burn-in It differs from a t distribution fromfrequentist analysis in that it is shifted toward 100 and has a smallerupper tail The bottom right panel shows the autocorrelation functionfor μμ There is no discernible autocorrelation

HIERARCHICAL BAYESIAN MODELS 587

There are a number of formal procedures for assessingconvergence in the statistical literature (see eg Gel-man amp Rubin 1992 Geweke 1992 Raftery amp Lewis1992 see also Gelman et al 2004)

A PROBIT MODEL FOR ESTIMATING A PROBABILITY OF SUCCESS

The next step toward a hierarchical signal detectionmodel is to revisit estimation of a probability from indi-vidual trials Once again y successes are observed in Ntrials and y ~ Binomial(N p) where p is the parameterof interest In the previous section a beta prior was placedon parameter p and the conjugacy of the beta with the bi-nomial directly led to the posterior Unfortunately a betadistribution prior is not convenient as a prior for the sig-nal detection application

The signal detection model relies on the probit trans-form Figure 12 shows the probit transform which mapsprobabilities into z values using the normal inverse cu-mulative distribution function Whereas p ranges from 0to 1 z ranges across the real number line Let Φ denotethe standard normal cumulative distribution function Withthis notation p Φ(z) and conversely z Φ1( p)

Signal detection parameters are defined in terms ofprobit transforms

and

where p(h) and p(f ) are hit and false alarm probabilities re-spectively As a precursor to models of signal detectionwe develop estimation of the probit transform of a proba-bility of success z Φ1( p) The methods for doing sowill then be used in the signal detection application

When using a probit transform we model the y suc-cesses in N trials as

y ~ Binomial[N Φ(z)]

where z is the parameter of interestWhen using a probit it is convenient to place a normal

prior on z

(31)

Figure 13 shows various normal priors on z and by trans-form the corresponding prior on p Consider the specialcase when a standard normal is placed on z As is shownin the figure the corresponding prior on p is flat A flatprior also corresponds to a beta with a b 1 (see Fig-ure 6) From the previous development if a beta prior isplaced on p the posterior is also a beta with parametersaprime a y and bprime b (N y) Noting for a flat priorthat a b 1 it is expected that the posterior of p is dis-tributed as a beta with parameters aprime 1 y and bprime 1 (N y)

The next step is to derive the posterior f (z |Y ) for anormal prior on z The straightforward approach is tonote that the posterior is proportional to the likelihoodmultiplied by the prior

This equation is intractable Albert and Chib (1995) pro-vide an alternative for analysis and the details of this al-ternative are provided in Appendix A The basic idea isthat a new set of latent variables is defined upon which zis conditioned Conditional posteriors for both z andthese new latent variables are easily derived and sam-pled These samples can then be used in Gibbs samplingto find a marginal posterior for z

In this application it may be desirable to estimate theposterior of p as well as that of z This estimation is easyin MCMC Because p is the inverse probit transform ofz samples can also be inversely transformed Let [z] bethe chain of samples from the posterior of z A chain ofposterior samples of p is given by [ p]m Φ( [z]m)

The top right panel of Figure 14 shows the histogramof the posterior of p for y 7 successes on N 10 tri-als The prior parameters were μ0 0 and σ 0

2 1 so theprior on p was flat (see Figure 13) Because the prior isflat it corresponds to a beta with parameters (a 1 b 1) Consequently the expected posterior is a beta withdensity aprime y 1 8 and bprime (N y) 1 4 Thechains in the Gibbs sampler were started from a numberof values of [z]1 and convergence was always rapid Theresulting histogram for 10000 samples is plotted it ap-proximates the expected distribution The bottom panelshows a small degree of autocorrelation This autocorre-lation can be overcome by running longer chains andusing a more conservative burn-in period Chains of

f z Y z z

z

Y N Y( | ) ( ) [ ( )]

exp

Φ Φ1

20

2

02 1

minus

times minus minus( ) ( )minus

minusμ σ⎡⎡

⎣⎢⎤⎦⎥

z ~ Normal 0μ σ02( )

c p= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pΦ Φ1 1( ) ( )h f

Figure 12 Probit transform of probability ( p) into z scores (z)The transform is the inverse cumulative distribution function ofthe normal distribution Lines show the transform of p 69 intoz 05

588 ROUDER AND LU

1000 samples with 100 samples of burn-in are morethan sufficient for this application

ESTIMATING PROBABILITY FORSEVERAL PARTICIPANTS

In this section we propose a model for estimating sev-eral participantsrsquo probability of success The main goalis to introduce hierarchical models within the context ofa relatively simple example The methods and insightsdiscussed here are directly applicable to the hierarchicalmodel of signal detection Consider data from a numberof participants each with his or her own unique proba-bility of success pi In conventional estimation each ofthese probabilities is estimated separately usually by theestimator piyi Ni where yi and Ni are the number ofsuccesses and trials respectively for the ith participantWe show here that hierarchical modeling leads to moreaccurate estimation

In a hierarchical model it is assumed that although in-dividuals vary in ability the distribution governing indi-vidualsrsquo true abilities is orderly We term this distributionthe parent distribution If we knew the parent distributionthis prior information would be useful in estimationConsider the example in Figure 15 Suppose 10 individ-uals each performed 10 trials Hypothetical performanceis indicated with Xs Letrsquos suppose the task is relativelyeasy so that individualsrsquo true probabilities of successrange from 7 to 10 as indicated by the curve in the fig-

ure One participant does poorly however succeeding ononly 3 trials The conventional estimate for this poor-performing participant is 3 which is a poor estimate ofthe true probability (between 7 and 10) If the parentdistribution is known it would be evident that this poorscore is influenced by a large degree of sample noise andshould be corrected upward as indicated by the arrowThe new estimate which is more accurate is denoted byan H (for hierarchical)

In the example above we assumed a parent distribu-tion In actuality such assumptions are unwarranted Inhierarchical models parent distributions are estimatedfrom the data these parents in turn affect the estimatesfrom outlying data The estimated parent for the data inFigure 15 would have much of its mass in the upperranges When serving as a prior the estimated parentwould exert an upward tendency in the estimation of theoutlying participantrsquos probability leading to better esti-mation In Bayesian analysis the method of implement-ing a hierarchical model is to use a hierarchical priorThe parent distribution serves as a prior for each indi-vidual and this parent is termed the first stage of theprior The parent distribution is not fully specified butinstead free parameters of the parent are estimated fromthe data These parameters also need a prior on themwhich is termed the second stage of the prior

The probit transform model is ideally suited for a hi-erarchical prior A normally distributed parent may beplaced on transformed probabilities of success The ith

Prior 1 Prior 2 Prior 3

Priorson z

Corresponding Priors on p

N(01) N(11) N(0 2)4

3

2

1

0

4

3

2

1

0

4

3

2

1

0

ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6

5

4

3

2

1

0

5

4

3

2

1

0

5

4

3

2

1

0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Figure 13 Three normally distributed priors on z and the corresponding priors on p Priors on z were obtainedby drawing 100000 samples from the normal Priors on p were obtained by transforming each sample of z

HIERARCHICAL BAYESIAN MODELS 589

participantrsquos true ability zi is simply a sample from thisnormal The number of successes is modeled as

(32)

with a parent distribution (first-stage prior) given by

(33)

In this case parameters μ and σ 2 describe the locationand variance of the parent distribution and reflect group-level properties Priors are needed on μ and σ 2 (second-stage priors) Suitable choices are

(34)

andσ 2 ~ Inverse Gamma (a b) (35)

Values of μ0 σ 02 a and b must be specif ied before

analysisWe performed an analysis on hypothetical data in order

to demonstrate the hierarchical model Table 2 shows hy-pothetical data for a set of 20 participants performing 50trials These success frequencies were generated from abinomial but each individual had a unique true proba-bility of success These true probabilities2 which rangedbetween 59 and 80 are shown in the second column ofTable 2 Parameter estimators p0 (Equation 12) and p2(Equation 14) are shown

All of the theoretical results needed to derive the con-ditional posterior distributions have already been pre-

sented The Albert and Chib (1995) analysis for this ap-plication is provided in Appendix B A coded version ofMCMC for this model is presented in Appendix C

The MCMC chain was run for 10000 iterations Pri-ors were set to be fairly diffuse (μ0 0 σ 0

2 1000 a b 1) All parameters converged to steady-state behav-ior quickly (burn-in was conservative at 100 iterations)Autocorrelation of samples of pi for 2 select participantsis shown in the bottom panels of Figure 16 The auto-correlation in these plots was typical of other param-eters Autocorrelation for this model was no worse thanthat for the nonhierarchical Bayesian model of the pre-ceding section The top left panel of Figure 16 showsposterior means of the probability of success as a func-tion of the true probability of success for the hierarchi-cal model (filled circles) Also included are classical estimates from p0 (filled circles) Points from the hier-archical model are on average closer to the diagonalAs is indicated in Table 2 the hierarchical model haslower root mean squared error than does its frequentistcounterpart p0 The top right panel shows the hierarchi-cal estimates as a function of p0 The effect of the hier-archical structure is clear The extreme estimates in thefrequentist approach are moderated in the Bayesiananalysis

Estimator p2 also moderates extreme values The ten-dency is to pull them toward the middle specifically es-timates are closer to 5 than they are with p0 For the dataof Table 2 the difference in efficiency between p2 andp0 is exceedingly small (about 0001) The reason thatthere was little gain for p2 is that it pulled estimates to-ward a value of 5 which is not the mean of the trueprobabilities The mean was p 69 and for this valueand these sample sizes estimators p2 and p0 have aboutthe same efficiency The hierarchical estimator is supe-

μ μ σ~ Normal 02( )

zi | μ σ μ σ2 ind~ Normal 2( )

y z N zi i i i| ind~ Binomial Φ( )⎡⎣ ⎤⎦

Figure 14 (Top) Histograms of the post-burn-in samples of zand p respectively for 7 successes out of 10 trials The line fit forp is the true posterior a beta distribution with a primeprime 8 and bprimeprime 4(Bottom) Autocorrelation function for z There is a small degreeof autocorrelation

Figure 15 Example of how a parent distribution affects esti-mation The outlying observation is incompatible with the parentdistribution Bayesian estimation allows for the influence of theparent as a prior The resulting estimate is shifted upward re-ducing error

590 ROUDER AND LU

rior to these nonhierarchical estimators because it pullsoutlying estimates toward the mean value p Of coursethe advantage of the hierarchical model is a function ofsample size With increasing sample sizes the amountof pulling decreases In the limit all three estimatorsconverge to the true individualrsquos probability

A HIERARCHICAL SIGNAL DETECTION MODEL

Our ultimate goal is a signal detection model that ac-counts for both participant and item variability As a pre-cursor we present a hierarchical signal detection modelwithout item variability the model will be expanded inthe next section to include item variability The modelhere is applicable to psychophysical experiments in whichsignal and noise trials are identical replicates respectively

Signal detection parameters are simple subtractions ofprobit-transformed probabilities Let d primei and ci denotesignal detection parameters for the ith participant

and

where pi(h) and p

i(f ) are individualsrsquo true hit and false

alarm probabilities respectively Let hi and fi be the pro-bit transforms hi1( p i

(h)) and fi1( pi(f )) Then

sensitivity and criteria are given by

d primei hi fiand

ci fi

The tactic we follow is to place hierarchical models on hand f rather than on signal detection parameters directlyBecause the relationship between signal detection pa-rameters and probit-transformed probabilities is linearaccurate estimation of hi and fi results in accurate esti-mation of d primei and ci

The model is developed as follows Let Ni and Si de-note the number of noise and signal trials given to the ithparticipant Let y

i(h) and y

i(f ) denote the the number of

hits and false alarms respectively The model for yi(h) and

yi(f ) is given by

and

It is assumed that each participantrsquos parameters aredrawn from parent distributions These parent distribu-tions serve as the first stage of the hierarchical prior andare given by

(36)

and

(37)

Second-stage priors are placed on parent distribution pa-rameters (μk σ k

2) as

(38)

and

(39)σ k k ka b k2 ~ Inverse Gamma h f( ) =

μk k ku v k~ Normal h f( ) =

f iind~ Normal f fμ σ 2( )

hiind~ Normal h hμ σ 2( )

y f N fi i i i( ) | f ind

~ Binomial Φ( )⎡⎣ ⎤⎦

y h S hi i i i( ) | h ind

~ Binomial Φ( )⎡⎣ ⎤⎦

c pi i= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pi i iΦ Φ1 1( ) ( )h f

Table 2Data and Estimates of a Probability of Success

Participant True Prob Data p0 p2 Hierarchical

1 587 33 66 654 6722 621 28 56 558 6153 633 34 68 673 6834 650 29 58 577 6275 659 32 64 635 6606 665 37 74 731 7167 667 30 60 596 6388 667 29 58 577 6279 687 34 68 673 684

10 693 34 68 673 68211 696 34 68 673 68312 701 37 74 731 71513 713 38 76 750 72814 734 38 76 750 72715 736 37 74 731 71716 738 33 66 654 67217 743 37 74 731 71618 759 33 66 654 67219 786 38 76 750 72820 803 42 84 827 771

RMSE 053 053 041Data are numbers of successes in 50 trials

HIERARCHICAL BAYESIAN MODELS 591

Analysis takes place in two steps The first is to obtainposterior samples from hi and fi for each participantThese are probit-transformed parameters that may be es-timated following the exact procedure in the previoussection Separate runs are made for signal trials (produc-ing estimates of hi) and for noise trials (producing esti-mates of fi) These posterior samples are denoted [hi] and[ fi] respectively The second step is to use these posteriorsamples to estimate posterior distributions of individuald primei and ci This is easily accomplished For all samples

(40)

and(41)

Hypothetical data and parameter estimates are shownin Table 3 Individual variability in the data was imple-mented by sampling each participantrsquos d prime and c from alog-normal and from a normal distribution respectivelyTrue d prime values are shown and they range from 123 to234 From these true values of d prime and c true individualhit and false alarm probabilities were computed and in-dividual hit and false alarm frequencies were sampledfrom a binomial Hit and false alarm frequencies wereproduced in this manner for 20 hypothetical individualseach observing 50 signal and 50 noise trials

A conventional sensitivity estimate for each individ-ual was obtained using the formula

Φ1(Hit rate) Φ1(False alarm rate)

The resulting estimates are shown in the last column ofTable 3 labeled Conventional The hierarchical signaldetection model was analyzed with diffuse prior param-eters ah af bh bf 01 uh uf 0 and vh vf 1000 The chain was run for 10000 iterations with thefirst 500 serving as burn-in This choice of burn-in pe-riod is very conservative Autocorrelation for sensitivityfor 2 select participants is displayed in the bottom pan-els of Figure 17 this degree of autocorrelation was typ-ical for all parameters Autocorrelation was about thesame as in the previous application and the resulting es-timates are shown in the next-to-last column of Table 3

The top left panel of Figure 17 shows a plot of esti-mated dprime as a function of true d prime Estimates from the hi-erarchical model are closer to the diagonal than are theconventional estimates indicating more accurate esti-mation Table 3 also provides efficiency information thehierarchical model is 33 more efficient than the con-ventional estimates (this amount is typical of repeatedruns of the same simulation) The reason for this gain isevident in the top right panel of Figure 17 The extremeestimates in the conventional method which most likely

c fi m i m⎡⎣ ⎤⎦ = minus ⎡⎣ ⎤⎦

prime⎡⎣ ⎤⎦ = ⎡⎣ ⎤⎦ minus ⎡⎣ ⎤⎦d h fi m i m m

ConventionalHierarchical

Participant 5 Participant 15

True Probability Conventional Estimates

Lag Lag

5 6 7 8 9 5 6 7 8 9

0 10 20 30 40 0 10 20 30 40

Aut

ocor

rela

tion

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

9

8

7

6

5

8

4

0

9

8

7

6

5

8

4

0

Figure 16 Modeling the probability of success across several partici-pants (Top left) Probability estimates as functions of true probabilitythe triangles represent estimator p0 and the circles represent posteriormeans of pi from the hierarchical model (Top right) Hierarchical esti-mates as a function of p0 (Bottom) Autocorrelation functions of the pos-terior probability estimates for 2 participants from the hierarchicalmodel The degree of autocorrelation is typical of the other participants

592 ROUDER AND LU

reflect sample noise are shrunk to more moderate esti-mates in the hierarchical model

SIGNAL DETECTION WITH ITEMVARIABILITY

In the introduction we stressed that Bayesian hierar-chical modeling may provide a solution for the statisti-cal problems associated with unmodeled variability innonlinear contexts In all of the previous examples thegains from hierarchical modeling were in better effi-ciency Conventional analysis although not as accurateas Bayesian analysis certainly yielded consistent if notunbiased estimates In this section item variability isadded to signal detection Conventional analyses inwhich scores are aggregated over items result in incon-sistent estimates characterized by asymptotic bias Inthis section we show that in contrast to conventionalanalysis Bayesian hierarchical models provide consis-tent and accurate estimation

Our route to the model is a bit circuitous Our first at-tempt based on a straightforward extension of the pre-vious techniques fails because of an excessive amountof autocorrelation The solution to this failure is to use apowerful and recently developed sampling schemeblocked sampling (Roberts amp Sahu 1997) First we willproceed with the straightforward but problematic exten-sion Then we will introduce blocked sampling and showhow it leads to tractable analysis

Straightforward ExtensionIt is straightforward to specify models that include

item effects The previous notation is here expanded byindexing hits and false alarms by both participants and

items Let yij(h) and yij

(f ) denote the response of the ith per-son to the jth item Values of yij

(h) and yij(f ) are dichoto-

mous indicating whether the response was old or newLet pij

(h) and pij(f ) be true probabilities that yij

(h) 1 andyij

(f ) 1 respectively that is these probabilities are truehit and false alarm rates for a given participant-by-itemcombination respectively Let hij and fij be the probittransforms [hij Φ1( pij

(h)) fij Φ1( pij(f ))] The model

is given by

(42)

and

(43)

Linear models are placed on hij and fij

hij μh αi γj (44)

and

fij μf βi δj (45)

Parameters μh and μf correspond to the grand means ofthe hit and false alarm rates respectively Parameters ααand ββ denote participant effects on hits and false alarms respectively parameters γγ and δδ denote item effects onhits and false alarms respectively These participant anditem effects are assumed to be zero-centered random ef-fects Grand mean signal detection parameters denotedd prime and c are given by

d prime μh μf (46)

and

c μf (47)

y fij ij( ) f ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

y hij ij( )h ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

Table 3Data and Estimates of Sensitivity

FalseParticipant True d prime Hits Alarms Hierarchical Conventional

1 1238 35 7 1608 16052 1239 38 17 1307 11193 1325 38 11 1554 14784 1330 33 16 1146 08805 1342 40 19 1324 11476 1405 39 8 1730 17677 1424 37 12 1466 13508 1460 27 5 1417 13829 1559 35 9 1507 1440

10 1584 37 9 1599 155911 1591 33 8 1486 140712 1624 40 7 1817 192213 1634 35 11 1427 129714 1782 43 13 1687 172415 1894 33 3 1749 196716 1962 45 12 1839 198817 1967 45 4 2235 268718 2124 39 6 1826 194719 2270 45 5 2172 256320 2337 46 6 2186 2580

RMSE 183 275Data are numbers of signal responses in 50 trials

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 9: Rouder & Lu (2005)

HIERARCHICAL BAYESIAN MODELS 581

This occurs only if the value of k is a beta functionmdashforexample k 1Be(a prime bprime) Consequently

(21)

The posterior like the prior is a beta but with param-eters aprime and bprime given by Equation 20

The derivation above was greatly facilitated by sepa-rating those terms that depend on the parameter of inter-est from those that do not The latter terms serve as anormalizing constant and can be computed by ensuringthat the posterior integrates to 10 In most derivationsthe value of the normalizing constant becomes apparentmuch as it did above Consequently it is convenient toconsider only those terms that depend on the parameterof interest and to lump the rest into a proportionality con-stant Hence a convenient form of Bayesrsquos theorem is

Posterior distribution Likelihood function Prior distribution

The symbol ldquordquo denotes proportionality and is read isproportional to This form of Bayesrsquos theorem is usedthroughout this article

Figure 7 shows both the prior and posterior of p forthe case of y 7 successes in N 10 trials The priorcorresponds to a beta with a b 1 and the observeddata are 7 successes in 10 trials The posterior in thiscase is a beta distribution with parameters aprime 8 andbprime 4 As can be seen the posterior is considerablymore narrow than the prior indicating that a large degreeof information has been gained from the data There aretwo valuable quantities derived from the posterior distri-bution the posterior mean and the 95 highest densityregion (also termed the 95 credible interval) The for-mer is the point estimate of p and the latter serves anal-ogously to a confidence interval These two quantitiesare also shown in Figure 7

For the case of a beta-distributed posterior the ex-pression for the posterior mean is

The posterior mean reduces to p1 if a b 5 it reducesto p2 if a b 1 Therefore estimators p1 and p2 aretheoretically justified from a Bayesian perspective Theestimator p0 may be obtained for values of a b 0 Forthese values however the integral of the prior distribu-tion is infinite If the integral of a distribution is infinitethe distribution is termed improper Conversely if the integral is finite the distribution is termed proper InBayesian analysis it is essential that posterior distribu-tions be proper However it is not necessary for the priorto be proper In some cases an improper prior will stillyield a proper posterior When this occurs the analysisis perfectly valid For estimation of p the posterior isproper even when a b 0

When the prior and posterior are from the same distri-bution the prior is termed conjugate The beta distribu-tion is the conjugate prior for binomially distributed databecause the resulting posterior is also a beta distributionConjugate priors are convenient they often facilitate thederivation of posterior distributions Conjugacy howeveris not at all necessary for Bayesian analysis A discussionabout conjugacy is provided in the General Discussion

BAYESIAN ESTIMATION WITH NORMALLY DISTRIBUTED DATA

In this section we present Bayesian analysis for nor-mally distributed data The normal plays a critical role inthe hierarchical signal detection model We will assumethat participant and item effects on sensitivity and biasare distributed as normals The results and techniquesdeveloped in this section are used directly in analyzingthe subsequent hierarchical signal detection models

Consider the case in which a sequence of observa-tions w (w1 wN) are independent and identicallydistributed as normals

The goal is to estimate parameters μ and σ 2 In this sec-tion we discuss a more limited goalmdashthe estimation ofone of the parameters assuming that the other is knownIn the next section we will introduce a form of Markovchain Monte Carlo sampling for estimation when bothparameters are unknown

Estimating μμ With Known σσ 2

Assume that σ 2 is known and that the estimation of μis of primary interest The goal then is to derive the pos-

wjind~ Normal μ σ 2( )

p aa b

y aN a b

= primeprime + prime

= ++ +

f p yp p

a b

a b

( | )( )( )

= minusprime prime

primeminus primeminus1 11Be

Figure 7 Prior and posterior distributions for 7 successes outof 10 trials Both are distributed as betas Parameters for theprior are a 1 and b 1 parameters for the posterior are aprimeprime 8 and b primeprime 4 The posterior mean and 95 highest density regionare also indicated

582 ROUDER AND LU

terior f (μ |σ 2 w) According to the proportional formof Bayesrsquos theorem

All terms are conditioned on σ 2 to reflect the fact that itis known

The first step is to choose a prior distribution for μThe normal is a suitable choice

Parameters μ0 and σ 02 must be specified before analysis

according to the researcherrsquos beliefs about μ By choos-ing σ 0

2 to be increasingly large the researcher can makethe prior arbitrarily variable In the limit the prior ap-proaches having equal density across all values Thisprior is considered noninformative for μ

The density of the prior is given by

Note that the prior on μ does not depend on σ 2 there-fore f (μ) f (μ |σ2)

The next step is multiplying the likelihood f (w | μσ 2) by the prior density f (μ |σ 2) For normally distrib-uted data the likelihood is given by

(22)

Multiplying the equation above by the prior yields

We concern ourselves with only those terms that involveμ the parameter of interest Other terms are used to nor-malize the posterior density and are absorbed into theconstant of proportionality

where

(23)

and

(24)

The posterior is proportional to the density of a normalwith mean and variance given by μprime and σ 2prime respectivelyBecause the conditional posterior distribution must inte-grate to 10 this distribution must be that of a normalwith mean μprime and variance σ 2prime

(25)Unfortunately the notation does not make it clear thatboth μprime and σ 2prime depend on the value of σ 2 When this de-pendence is critical it will be made explicit μprime(σ 2) andσ 2prime(σ 2)

Because the posterior and prior are from the samefamily the normal distribution is the conjugate prior forthe population mean with normally distributed data (withknown variance) In the limit that the prior is diffuse (ieas its variance is made increasingly large) the posteriorof μ is a normal with μprime w and σ 2prime σ 2N This dis-tribution is also the sampling distribution of the samplemean in classical statistics Hence for the diffuse priorthe Bayesian and conventional approaches yield the sameresults

Figure 8 shows the estimation for an example in whichthe data are w (112 106 104 111) and the variance isassumed known to be 16 The mean of these four obser-vations is 10825 For demonstration purposes the prior

μ σ μ σ| ~ 2 2w Normal prime( )prime

prime = +⎛⎝⎜

⎞⎠⎟

minus minus primesumμ σ σ μ σ20

20

2w jj

σ σ σprime minus minus minus= +( )2 2

02 1

N

f μ σμ μ

σ| exp 2

2

22w( ) minus minus prime( )⎛

⎝⎜⎜

⎠⎟⎟prime

f N( | ) expμ σσ σ

μ22

02

212

1w minus +⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

⎣⎢⎢

⎝⎜

minusminussum

+⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

+sum

+⎧⎨

22

0

02

2

202

02

w

w

jj

jj

σμσ

μ

σμσ

⎪⎪

⎩⎪

⎫⎬⎪

⎭⎪

⎦⎥⎥

⎠⎟

fw wj jjμ σ

μ μ

σ

μ

| exp22 2

212

2w( ) minus

minus +( )⎡

⎣⎢⎢

⎝⎜⎜

+

sum

220 0

2

02

2minus + ⎤

⎦⎥⎥

⎠⎟

μ μ μσ

fw j

j

μ σμ

σ

μ μ

| exp

exp

2

2

2

0

2w( ) minus minus( )⎛

⎜⎜

⎟⎟

timesminus minus

sum

(( )⎛

⎝⎜⎜

⎠⎟⎟

2

022σ

fw

N

j

j

μ σπσ

μ

σ| exp

2

2 2

2

21

2 2w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

timesminus minus( )⎛

⎝⎜⎜

⎠⎟⎟

1

2 20

0

2

02π σ

μ μ

σexp

fw

N

j

j

w | exp

μ σπσ

μ

σ2

2 2

2

21

2 2( ) =

( )minus minus( )⎛

⎜⎜

⎟sum ⎟⎟

f ( ) exp μπσ

μ μσ

=minus minus( )⎛

⎝⎜⎜

⎠⎟⎟

1

2 20

0

2

02

μ μ σ~ Normal 0 02( )

f f fμ σ μ σ μ σ| | | 2 2 2w w( ) ( ) ( )

HIERARCHICAL BAYESIAN MODELS 583

is constructed to be fairly informative with μ0 100and σ 0

2 100 The posterior is narrower than the priorreflecting the information gained from the data It is cen-tered at μprime 1079 with variance 385 Note that theposterior mean is not the sample mean but is modestlyinfluenced by the prior Whether the Bayesian or the con-ventional estimate is better depends on the accuracy ofthe prior For cases in which much is known about thedependent measure it may be advantageous to reflectthis information in the prior

Estimating σσ 2 With Known μμIn the last section the posterior for μ was derived for

known σ 2 In this section it is assumed that μ is knownand that the estimation of σ 2 is of primary interest Thegoal then is to estimate the posterior f (σ 2 | μ w) The inverse-gamma distribution is the conjugate prior for σ 2

with normally distributed data The inverse-gamma dis-tribution is not widely used in psychology As its nameimplies it is related to the gamma distribution If a ran-dom variable g is distributed as a gamma then 1g is dis-tributed as an inverse gamma An inverse-gamma prioron σ 2 has density

The parameters of the inverse gamma are a and b and inapplication these are chosen beforehand As a and b ap-proach 0 the inverse gamma approaches 1σ 2 which isconsidered the appropriate noninformative prior for σ 2

(Jeffreys 1961)

Multiplying the inverse-gamma prior by the likelihoodgiven in Equation 22 yields

Collecting only those terms dependent on σ 2 yields

where

aprime N2 a (26)and

(27)

The posterior is proportional to the density of an inversegamma with parameters aprime and bprime Because this posteriorintegrates to 10 it must also be an inverse gamma

σ 2 | μ w ~ Inverse Gamma (aprime bprime) (28)

Note that posterior parameter bprime depends explicitly onthe value of μ

MARKOV CHAIN MONTE CARLO SAMPLING

The preceding development highlights a problem Theuse of conjugate priors allowed for the derivation of pos-teriors only when some of the parameters were assumedto be known The posteriors in Equations 25 and 28 areknown as conditional posteriors because they depend onother parameters The quantities of primary interest arethe marginal posteriors μ | w and σ 2 | w The straightfor-

prime = minus( )⎡⎣⎢

⎤⎦⎥

+sumb w bj j μ2

2

f ba

σ μσ σ

2

2 1 21| exp w( )

( )minus prime⎛

⎝⎜⎞⎠⎟prime+

f

w b

N a

jj

σ μσ

μ

σ

2

2 2 1

2

2

1

5

|

exp

w( )

( )times minus

minus( ) +⎛

+ +

sum

⎝⎝

⎜⎜

⎟⎟

fw

N

j

j

σ μσ

μ

σ2

2 2

2

21

2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟⎟

times

sum

112 1 2

σ σ( )minus⎛

⎝⎜⎞⎠⎟+a

bexp

fw

N

j

j

σ μπσ

μ

σ2

2 2

2

21

2 2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

times( )

minus⎛⎝⎜

⎞⎠⎟+

b

a

ba

aΓ( )

exp σ σ2 1 2

f b

a

b a ba

σ σσ2

2 1 22 0( ) =

( )minus⎛

⎝⎜⎞⎠⎟

ge+

Γ( )exp

Figure 8 Prior and posterior distributions for estimating themean of normally distributed data with known variance Theprior and posterior are the dashed and solid distributions re-spectively The vertical line shows the sample mean The priorldquopullsrdquo the posterior slightly downward in this case

584 ROUDER AND LU

ward method of obtaining marginals is to integrate conditionals

In this case the integral is tractable but in many cases itis not When an integral is symbolically intractable how-ever it can often still be evaluated numerically One ofthe recent developments in numeric integration is Markovchain Monte Carlo (MCMC) sampling MCMC sam-pling has become quite popular in the last decade and isdescribed in detail in Gelfand and Smith (1990) We showhere how MCMC can be used to estimate the marginalposterior distributions of μ and σ 2 without assumingknown values We use a particular type of MCMC sam-pling known as Gibbs sampling (Geman amp Geman 1984)Gibbs sampling is a restricted form of MCMC the moregeneral form is known as MetropolisndashHastings MCMC

The goal is to find the marginal posterior distribu-tions MCMC provides a set of random samples from themarginal posterior distributions (rather than a closed-form derivation of the posterior density or cumulativedistribution function) Obtaining a set of random sam-ples from the marginal posteriors is sufficient for analy-sis With a sufficiently large sample from a random vari-able one can compute its density mean quantiles andany other statistic to arbitrary precision

To explain Gibbs sampling we will digress momen-tarily using a different example Consider the case ofgenerating samples from an ex-Gaussian random vari-able The ex-Gaussian is a well-known distribution incognitive psychology (Hohle 1965) It is the sum of nor-mal and exponential random variables The parametersare μ σ and τ the location and scale of the normal andthe scale of the exponential respectively An ex-Gaussianrandom variable x can be expressed in conditional form

andη ~ Exponential(τ )

We can take advantage of this conditional form to gen-erate ex-Gaussian samples First we generate a set ofsamples from η Samples will be denoted with squarebrackets [η]1 denotes the first sample [η]2 denotes thesecond and so on These samples can then be used in theconditional form to generate samples from the marginalrandom variable x For example the first two samples aregiven by

and

In this manner a whole set of samples from the mar-ginal x can be generated from the conditional randomvariable x |η Figure 9 shows that the method works Ahistogram of [x] generated in this manner converges tothe true ex-Gaussian density This is an example of MonteCarlo integration of a conditional density

We now return to the problem of deriving marginalposterior distributions of μ | w from the conditionalμ |σ 2 w If there was a set of samples from σ 2 | w MonteCarlo integration could be used to sample μ | w Like-wise if there was a set of samples of μ | w Monte Carlointegration could be used to sample σ 2 | w In Gibbs sam-pling these relations are used iteratively as followsFirst a value of [σ 2]1 is picked arbitrarily This value isthen used to sample the posterior of μ from the condi-tional in Equation 25

where μprime and σ 2 prime are the explicit functions of σ 2 givenin Equations 23 and 24 Next the value of [μ]1 is used tosample σ 2 from the conditional posterior in Equation 28In general for iteration m

(29)

and

(30)

In this manner sequences of random samples [μμ] and[σσ 2] are obtained

The initial samples of [μμ] and [σσ 2] certainly reflect thechoice of starting values [σ 2]1 and are not samples fromthe desired marginal posteriors It can be shown how-ever that under mild technical conditions the later sam-ples of [μμ] and [σσ 2] are samples from the desired mar-ginal posteriors (Tierney 1994) The initial region in

σ μ21

⎡⎣ ⎤⎦ prime prime ⎡⎣ ⎤⎦( )( )minusm ma b~ Inverse Gamma

[ ] μ μ σ σ σm m m~ Normal prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime2 2 2

μ μ σ σ σ⎡⎣ ⎤⎦ prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime1

2 2~ Normal1

2

1

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )2 22~ Normal +μ η σ

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )1 12~ Normal +μ η σ

x | ~ η μ η σNormal +( )2

f f f d( | ) | | μ μ σ σ σw w w= ( ) ( )int 2 2 2

Figure 9 An example of Monte Carlo integration of a normalconditional density to yield ex-Gaussian distributed samples Thehistogram presents the samples from a normal whose mean pa-rameter is distributed as an exponential The line is the densityof the appropriate ex-Gaussian distribution

HIERARCHICAL BAYESIAN MODELS 585

which the samples reflect the starting value is termed theburn-in period The later samples are the steady-state pe-riod Formally speaking the samples form irreducibleMarkov chains with stationary distributions that equalthe marginal posterior distribution (see Gilks Richard-son amp Spiegelhalter 1996) On a more informal levelthe first set of samples have random and arbitrary valuesAt some point by chance the random sample happens tobe from a high-density region of the true marginal pos-terior From this point forward the samples are from thetrue posteriors and may be used for estimation

Let m0 denote the point at which the samples are ap-proximately steady state Samples after m0 can be used toprovide both posterior means and credible regions of theposterior (as well as other statistics such as posteriorvariance) Posterior means are estimated by the arith-metic means of the samples

and

where M is the number of iterations Credible regions areconstructed as the centermost interval containing 95of the samples from m0 to M

To use MCMC to estimate posteriors of μ and σ 2 it isnecessary to sample from a normal and an inverse-gamma

distribution Sampling from a normal is provided in manysoftware packages and routines for low-level languagessuch as C or Fortran are readily available Samples froman inverse-gamma distribution may be obtained by tak-ing the reciprocal of samples from a gamma distributionAhrens and Dieter (1974 1982) provide algorithms forsampling the gamma distribution

To illustrate Gibbs sampling consider the case of es-timating IQ The hypothetical data for this example arefour replicates from a single participant w (112 106104 111) The classical estimation of μ is the samplemean (10825) and confidence intervals are constructedby multiplying the standard error (sw193) by the ap-propriate critical t value with three degrees of freedomFrom this multiplication the 95 confidence intervalfor μ is [1022 1144] Bayesian estimation was donewith Gibbs sampling Diffuse priors were placed on μ(μ0 0 σ 0

2 106) and σ 2 (a 106 b 106) Twochains of [μμ] are shown in the left panel of Figure 10each corresponding to a different initial value for [σ 2]1As can be seen both chains converge to their steady staterather quickly For this application there is very littleburn-in The histogram of the samples of [μμ] after burn-in is also shown (right panel) The histogram conforms toa t distribution with three degrees of freedom The poste-rior mean is 10826 and the 95 credible interval is[1022 1144] For normally distributed data the diffusepriors used in Bayesian analysis yield results that are nu-merically equivalent to their frequentist counterparts

The diffuse priors used in the previous case representthe situation in which researchers have no previous knowl-

σσ

2

2

0

0=

⎡⎣ ⎤⎦

minusgtsum

mm m

M m

μμ

=minus

gtsum [ ]m

m m

M m0

0

15

10

05

0

120

115

110

105

100

0 10 20 30 40 50 95 100 105 110 115 120IQ PointsIteration

Den

sity

IQ points

micro

Burn-in

Figure 10 Samples of μμ from Markov chain Monte Carlo integration for normally dis-tributed data The left panel shows two chains each from a poor choice of the initial value ofσσ 2 (the solid and dotted lines are for [σσ 2]1 10000 and [σσ 2]1 0001 respectively) Con-vergence is rapid The right panel shows the histogram of samples of μμ after burn-in Thedensity is the appropriate frequentist t distribution with three degrees of freedom

586 ROUDER AND LU

edge In the IQ example however much is known aboutthe distribution of IQ scores Suppose our test was normedfor a mean of 100 and a standard deviation of 15 Thisknowledge could be profitably used by setting μ0 100and σ 0

2 152 225 We may not know the particulartestndashretest correlation of our scales but we can surmisethat the standard deviations of replicates should be lessthan the population standard deviation After experiment-ing with the values of a and b we chose a 3 and b 100 The prior on σ 2 corresponding to these values isshown in Figure 11 It has much of its density below 100reflecting our belief that the testndashretest variability issmaller than the variability in the population Figure 11also shows a histogram of the posterior of μ (the posteriorwas obtained with Gibbs sampling as outlined above[σ 2]1 1 burn-in of 100 iterations) The posterior meanis 10795 which is somewhat below the sample mean of10825 This difference is from the prior people on aver-age score lower than the obtained scores Although it is abit hard to see in the figure the posterior distribution ismore narrow in the extreme tails than is the correspond-ing t distribution The 95 credible interval is [10211136] which is modestly different from the correspond-ing frequentist confidence interval This analysis shows

that the use of reasonable prior information can lead to aresult different from that of the frequentist analysis TheBayesian estimate is not only different but often more ef-ficient than the frequentist counterpart

The theoretical underpinnings of MCMC guaranteethat infinitely long chains will converge to the true pos-terior Researchers must decide how long to burn in chainsand then how long to continue in order to approximatethe posterior well The most important consideration inthis decision is the degree of correlation from one sam-ple to another In MCMC sampling consecutive samplesare not necessarily independent of one another The valueof a sample on cycle m 1 is in general dependent onthat of m If this dependence is small then convergencehappens quickly and relatively short chains are neededIf this dependence is large then convergence is muchslower One informal graphical method of assessing thedegree of dependence is to plot the autocorrelation func-tions of the chains The bottom right panel of Figure 11shows that for the normal application there appears to beno autocorrelation that is there is no dependence fromone sample to the next This fact implies that conver-gence is relatively rapid Good approximations may beobtained with short burn-ins and relatively short chains

Figure 11 Analysis with informative priors The two top panels showthe prior distributions on μμ and σσ 2 The bottom left panel shows the his-togram of samples of μμ after burn-in It differs from a t distribution fromfrequentist analysis in that it is shifted toward 100 and has a smallerupper tail The bottom right panel shows the autocorrelation functionfor μμ There is no discernible autocorrelation

HIERARCHICAL BAYESIAN MODELS 587

There are a number of formal procedures for assessingconvergence in the statistical literature (see eg Gel-man amp Rubin 1992 Geweke 1992 Raftery amp Lewis1992 see also Gelman et al 2004)

A PROBIT MODEL FOR ESTIMATING A PROBABILITY OF SUCCESS

The next step toward a hierarchical signal detectionmodel is to revisit estimation of a probability from indi-vidual trials Once again y successes are observed in Ntrials and y ~ Binomial(N p) where p is the parameterof interest In the previous section a beta prior was placedon parameter p and the conjugacy of the beta with the bi-nomial directly led to the posterior Unfortunately a betadistribution prior is not convenient as a prior for the sig-nal detection application

The signal detection model relies on the probit trans-form Figure 12 shows the probit transform which mapsprobabilities into z values using the normal inverse cu-mulative distribution function Whereas p ranges from 0to 1 z ranges across the real number line Let Φ denotethe standard normal cumulative distribution function Withthis notation p Φ(z) and conversely z Φ1( p)

Signal detection parameters are defined in terms ofprobit transforms

and

where p(h) and p(f ) are hit and false alarm probabilities re-spectively As a precursor to models of signal detectionwe develop estimation of the probit transform of a proba-bility of success z Φ1( p) The methods for doing sowill then be used in the signal detection application

When using a probit transform we model the y suc-cesses in N trials as

y ~ Binomial[N Φ(z)]

where z is the parameter of interestWhen using a probit it is convenient to place a normal

prior on z

(31)

Figure 13 shows various normal priors on z and by trans-form the corresponding prior on p Consider the specialcase when a standard normal is placed on z As is shownin the figure the corresponding prior on p is flat A flatprior also corresponds to a beta with a b 1 (see Fig-ure 6) From the previous development if a beta prior isplaced on p the posterior is also a beta with parametersaprime a y and bprime b (N y) Noting for a flat priorthat a b 1 it is expected that the posterior of p is dis-tributed as a beta with parameters aprime 1 y and bprime 1 (N y)

The next step is to derive the posterior f (z |Y ) for anormal prior on z The straightforward approach is tonote that the posterior is proportional to the likelihoodmultiplied by the prior

This equation is intractable Albert and Chib (1995) pro-vide an alternative for analysis and the details of this al-ternative are provided in Appendix A The basic idea isthat a new set of latent variables is defined upon which zis conditioned Conditional posteriors for both z andthese new latent variables are easily derived and sam-pled These samples can then be used in Gibbs samplingto find a marginal posterior for z

In this application it may be desirable to estimate theposterior of p as well as that of z This estimation is easyin MCMC Because p is the inverse probit transform ofz samples can also be inversely transformed Let [z] bethe chain of samples from the posterior of z A chain ofposterior samples of p is given by [ p]m Φ( [z]m)

The top right panel of Figure 14 shows the histogramof the posterior of p for y 7 successes on N 10 tri-als The prior parameters were μ0 0 and σ 0

2 1 so theprior on p was flat (see Figure 13) Because the prior isflat it corresponds to a beta with parameters (a 1 b 1) Consequently the expected posterior is a beta withdensity aprime y 1 8 and bprime (N y) 1 4 Thechains in the Gibbs sampler were started from a numberof values of [z]1 and convergence was always rapid Theresulting histogram for 10000 samples is plotted it ap-proximates the expected distribution The bottom panelshows a small degree of autocorrelation This autocorre-lation can be overcome by running longer chains andusing a more conservative burn-in period Chains of

f z Y z z

z

Y N Y( | ) ( ) [ ( )]

exp

Φ Φ1

20

2

02 1

minus

times minus minus( ) ( )minus

minusμ σ⎡⎡

⎣⎢⎤⎦⎥

z ~ Normal 0μ σ02( )

c p= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pΦ Φ1 1( ) ( )h f

Figure 12 Probit transform of probability ( p) into z scores (z)The transform is the inverse cumulative distribution function ofthe normal distribution Lines show the transform of p 69 intoz 05

588 ROUDER AND LU

1000 samples with 100 samples of burn-in are morethan sufficient for this application

ESTIMATING PROBABILITY FORSEVERAL PARTICIPANTS

In this section we propose a model for estimating sev-eral participantsrsquo probability of success The main goalis to introduce hierarchical models within the context ofa relatively simple example The methods and insightsdiscussed here are directly applicable to the hierarchicalmodel of signal detection Consider data from a numberof participants each with his or her own unique proba-bility of success pi In conventional estimation each ofthese probabilities is estimated separately usually by theestimator piyi Ni where yi and Ni are the number ofsuccesses and trials respectively for the ith participantWe show here that hierarchical modeling leads to moreaccurate estimation

In a hierarchical model it is assumed that although in-dividuals vary in ability the distribution governing indi-vidualsrsquo true abilities is orderly We term this distributionthe parent distribution If we knew the parent distributionthis prior information would be useful in estimationConsider the example in Figure 15 Suppose 10 individ-uals each performed 10 trials Hypothetical performanceis indicated with Xs Letrsquos suppose the task is relativelyeasy so that individualsrsquo true probabilities of successrange from 7 to 10 as indicated by the curve in the fig-

ure One participant does poorly however succeeding ononly 3 trials The conventional estimate for this poor-performing participant is 3 which is a poor estimate ofthe true probability (between 7 and 10) If the parentdistribution is known it would be evident that this poorscore is influenced by a large degree of sample noise andshould be corrected upward as indicated by the arrowThe new estimate which is more accurate is denoted byan H (for hierarchical)

In the example above we assumed a parent distribu-tion In actuality such assumptions are unwarranted Inhierarchical models parent distributions are estimatedfrom the data these parents in turn affect the estimatesfrom outlying data The estimated parent for the data inFigure 15 would have much of its mass in the upperranges When serving as a prior the estimated parentwould exert an upward tendency in the estimation of theoutlying participantrsquos probability leading to better esti-mation In Bayesian analysis the method of implement-ing a hierarchical model is to use a hierarchical priorThe parent distribution serves as a prior for each indi-vidual and this parent is termed the first stage of theprior The parent distribution is not fully specified butinstead free parameters of the parent are estimated fromthe data These parameters also need a prior on themwhich is termed the second stage of the prior

The probit transform model is ideally suited for a hi-erarchical prior A normally distributed parent may beplaced on transformed probabilities of success The ith

Prior 1 Prior 2 Prior 3

Priorson z

Corresponding Priors on p

N(01) N(11) N(0 2)4

3

2

1

0

4

3

2

1

0

4

3

2

1

0

ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6

5

4

3

2

1

0

5

4

3

2

1

0

5

4

3

2

1

0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Figure 13 Three normally distributed priors on z and the corresponding priors on p Priors on z were obtainedby drawing 100000 samples from the normal Priors on p were obtained by transforming each sample of z

HIERARCHICAL BAYESIAN MODELS 589

participantrsquos true ability zi is simply a sample from thisnormal The number of successes is modeled as

(32)

with a parent distribution (first-stage prior) given by

(33)

In this case parameters μ and σ 2 describe the locationand variance of the parent distribution and reflect group-level properties Priors are needed on μ and σ 2 (second-stage priors) Suitable choices are

(34)

andσ 2 ~ Inverse Gamma (a b) (35)

Values of μ0 σ 02 a and b must be specif ied before

analysisWe performed an analysis on hypothetical data in order

to demonstrate the hierarchical model Table 2 shows hy-pothetical data for a set of 20 participants performing 50trials These success frequencies were generated from abinomial but each individual had a unique true proba-bility of success These true probabilities2 which rangedbetween 59 and 80 are shown in the second column ofTable 2 Parameter estimators p0 (Equation 12) and p2(Equation 14) are shown

All of the theoretical results needed to derive the con-ditional posterior distributions have already been pre-

sented The Albert and Chib (1995) analysis for this ap-plication is provided in Appendix B A coded version ofMCMC for this model is presented in Appendix C

The MCMC chain was run for 10000 iterations Pri-ors were set to be fairly diffuse (μ0 0 σ 0

2 1000 a b 1) All parameters converged to steady-state behav-ior quickly (burn-in was conservative at 100 iterations)Autocorrelation of samples of pi for 2 select participantsis shown in the bottom panels of Figure 16 The auto-correlation in these plots was typical of other param-eters Autocorrelation for this model was no worse thanthat for the nonhierarchical Bayesian model of the pre-ceding section The top left panel of Figure 16 showsposterior means of the probability of success as a func-tion of the true probability of success for the hierarchi-cal model (filled circles) Also included are classical estimates from p0 (filled circles) Points from the hier-archical model are on average closer to the diagonalAs is indicated in Table 2 the hierarchical model haslower root mean squared error than does its frequentistcounterpart p0 The top right panel shows the hierarchi-cal estimates as a function of p0 The effect of the hier-archical structure is clear The extreme estimates in thefrequentist approach are moderated in the Bayesiananalysis

Estimator p2 also moderates extreme values The ten-dency is to pull them toward the middle specifically es-timates are closer to 5 than they are with p0 For the dataof Table 2 the difference in efficiency between p2 andp0 is exceedingly small (about 0001) The reason thatthere was little gain for p2 is that it pulled estimates to-ward a value of 5 which is not the mean of the trueprobabilities The mean was p 69 and for this valueand these sample sizes estimators p2 and p0 have aboutthe same efficiency The hierarchical estimator is supe-

μ μ σ~ Normal 02( )

zi | μ σ μ σ2 ind~ Normal 2( )

y z N zi i i i| ind~ Binomial Φ( )⎡⎣ ⎤⎦

Figure 14 (Top) Histograms of the post-burn-in samples of zand p respectively for 7 successes out of 10 trials The line fit forp is the true posterior a beta distribution with a primeprime 8 and bprimeprime 4(Bottom) Autocorrelation function for z There is a small degreeof autocorrelation

Figure 15 Example of how a parent distribution affects esti-mation The outlying observation is incompatible with the parentdistribution Bayesian estimation allows for the influence of theparent as a prior The resulting estimate is shifted upward re-ducing error

590 ROUDER AND LU

rior to these nonhierarchical estimators because it pullsoutlying estimates toward the mean value p Of coursethe advantage of the hierarchical model is a function ofsample size With increasing sample sizes the amountof pulling decreases In the limit all three estimatorsconverge to the true individualrsquos probability

A HIERARCHICAL SIGNAL DETECTION MODEL

Our ultimate goal is a signal detection model that ac-counts for both participant and item variability As a pre-cursor we present a hierarchical signal detection modelwithout item variability the model will be expanded inthe next section to include item variability The modelhere is applicable to psychophysical experiments in whichsignal and noise trials are identical replicates respectively

Signal detection parameters are simple subtractions ofprobit-transformed probabilities Let d primei and ci denotesignal detection parameters for the ith participant

and

where pi(h) and p

i(f ) are individualsrsquo true hit and false

alarm probabilities respectively Let hi and fi be the pro-bit transforms hi1( p i

(h)) and fi1( pi(f )) Then

sensitivity and criteria are given by

d primei hi fiand

ci fi

The tactic we follow is to place hierarchical models on hand f rather than on signal detection parameters directlyBecause the relationship between signal detection pa-rameters and probit-transformed probabilities is linearaccurate estimation of hi and fi results in accurate esti-mation of d primei and ci

The model is developed as follows Let Ni and Si de-note the number of noise and signal trials given to the ithparticipant Let y

i(h) and y

i(f ) denote the the number of

hits and false alarms respectively The model for yi(h) and

yi(f ) is given by

and

It is assumed that each participantrsquos parameters aredrawn from parent distributions These parent distribu-tions serve as the first stage of the hierarchical prior andare given by

(36)

and

(37)

Second-stage priors are placed on parent distribution pa-rameters (μk σ k

2) as

(38)

and

(39)σ k k ka b k2 ~ Inverse Gamma h f( ) =

μk k ku v k~ Normal h f( ) =

f iind~ Normal f fμ σ 2( )

hiind~ Normal h hμ σ 2( )

y f N fi i i i( ) | f ind

~ Binomial Φ( )⎡⎣ ⎤⎦

y h S hi i i i( ) | h ind

~ Binomial Φ( )⎡⎣ ⎤⎦

c pi i= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pi i iΦ Φ1 1( ) ( )h f

Table 2Data and Estimates of a Probability of Success

Participant True Prob Data p0 p2 Hierarchical

1 587 33 66 654 6722 621 28 56 558 6153 633 34 68 673 6834 650 29 58 577 6275 659 32 64 635 6606 665 37 74 731 7167 667 30 60 596 6388 667 29 58 577 6279 687 34 68 673 684

10 693 34 68 673 68211 696 34 68 673 68312 701 37 74 731 71513 713 38 76 750 72814 734 38 76 750 72715 736 37 74 731 71716 738 33 66 654 67217 743 37 74 731 71618 759 33 66 654 67219 786 38 76 750 72820 803 42 84 827 771

RMSE 053 053 041Data are numbers of successes in 50 trials

HIERARCHICAL BAYESIAN MODELS 591

Analysis takes place in two steps The first is to obtainposterior samples from hi and fi for each participantThese are probit-transformed parameters that may be es-timated following the exact procedure in the previoussection Separate runs are made for signal trials (produc-ing estimates of hi) and for noise trials (producing esti-mates of fi) These posterior samples are denoted [hi] and[ fi] respectively The second step is to use these posteriorsamples to estimate posterior distributions of individuald primei and ci This is easily accomplished For all samples

(40)

and(41)

Hypothetical data and parameter estimates are shownin Table 3 Individual variability in the data was imple-mented by sampling each participantrsquos d prime and c from alog-normal and from a normal distribution respectivelyTrue d prime values are shown and they range from 123 to234 From these true values of d prime and c true individualhit and false alarm probabilities were computed and in-dividual hit and false alarm frequencies were sampledfrom a binomial Hit and false alarm frequencies wereproduced in this manner for 20 hypothetical individualseach observing 50 signal and 50 noise trials

A conventional sensitivity estimate for each individ-ual was obtained using the formula

Φ1(Hit rate) Φ1(False alarm rate)

The resulting estimates are shown in the last column ofTable 3 labeled Conventional The hierarchical signaldetection model was analyzed with diffuse prior param-eters ah af bh bf 01 uh uf 0 and vh vf 1000 The chain was run for 10000 iterations with thefirst 500 serving as burn-in This choice of burn-in pe-riod is very conservative Autocorrelation for sensitivityfor 2 select participants is displayed in the bottom pan-els of Figure 17 this degree of autocorrelation was typ-ical for all parameters Autocorrelation was about thesame as in the previous application and the resulting es-timates are shown in the next-to-last column of Table 3

The top left panel of Figure 17 shows a plot of esti-mated dprime as a function of true d prime Estimates from the hi-erarchical model are closer to the diagonal than are theconventional estimates indicating more accurate esti-mation Table 3 also provides efficiency information thehierarchical model is 33 more efficient than the con-ventional estimates (this amount is typical of repeatedruns of the same simulation) The reason for this gain isevident in the top right panel of Figure 17 The extremeestimates in the conventional method which most likely

c fi m i m⎡⎣ ⎤⎦ = minus ⎡⎣ ⎤⎦

prime⎡⎣ ⎤⎦ = ⎡⎣ ⎤⎦ minus ⎡⎣ ⎤⎦d h fi m i m m

ConventionalHierarchical

Participant 5 Participant 15

True Probability Conventional Estimates

Lag Lag

5 6 7 8 9 5 6 7 8 9

0 10 20 30 40 0 10 20 30 40

Aut

ocor

rela

tion

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

9

8

7

6

5

8

4

0

9

8

7

6

5

8

4

0

Figure 16 Modeling the probability of success across several partici-pants (Top left) Probability estimates as functions of true probabilitythe triangles represent estimator p0 and the circles represent posteriormeans of pi from the hierarchical model (Top right) Hierarchical esti-mates as a function of p0 (Bottom) Autocorrelation functions of the pos-terior probability estimates for 2 participants from the hierarchicalmodel The degree of autocorrelation is typical of the other participants

592 ROUDER AND LU

reflect sample noise are shrunk to more moderate esti-mates in the hierarchical model

SIGNAL DETECTION WITH ITEMVARIABILITY

In the introduction we stressed that Bayesian hierar-chical modeling may provide a solution for the statisti-cal problems associated with unmodeled variability innonlinear contexts In all of the previous examples thegains from hierarchical modeling were in better effi-ciency Conventional analysis although not as accurateas Bayesian analysis certainly yielded consistent if notunbiased estimates In this section item variability isadded to signal detection Conventional analyses inwhich scores are aggregated over items result in incon-sistent estimates characterized by asymptotic bias Inthis section we show that in contrast to conventionalanalysis Bayesian hierarchical models provide consis-tent and accurate estimation

Our route to the model is a bit circuitous Our first at-tempt based on a straightforward extension of the pre-vious techniques fails because of an excessive amountof autocorrelation The solution to this failure is to use apowerful and recently developed sampling schemeblocked sampling (Roberts amp Sahu 1997) First we willproceed with the straightforward but problematic exten-sion Then we will introduce blocked sampling and showhow it leads to tractable analysis

Straightforward ExtensionIt is straightforward to specify models that include

item effects The previous notation is here expanded byindexing hits and false alarms by both participants and

items Let yij(h) and yij

(f ) denote the response of the ith per-son to the jth item Values of yij

(h) and yij(f ) are dichoto-

mous indicating whether the response was old or newLet pij

(h) and pij(f ) be true probabilities that yij

(h) 1 andyij

(f ) 1 respectively that is these probabilities are truehit and false alarm rates for a given participant-by-itemcombination respectively Let hij and fij be the probittransforms [hij Φ1( pij

(h)) fij Φ1( pij(f ))] The model

is given by

(42)

and

(43)

Linear models are placed on hij and fij

hij μh αi γj (44)

and

fij μf βi δj (45)

Parameters μh and μf correspond to the grand means ofthe hit and false alarm rates respectively Parameters ααand ββ denote participant effects on hits and false alarms respectively parameters γγ and δδ denote item effects onhits and false alarms respectively These participant anditem effects are assumed to be zero-centered random ef-fects Grand mean signal detection parameters denotedd prime and c are given by

d prime μh μf (46)

and

c μf (47)

y fij ij( ) f ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

y hij ij( )h ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

Table 3Data and Estimates of Sensitivity

FalseParticipant True d prime Hits Alarms Hierarchical Conventional

1 1238 35 7 1608 16052 1239 38 17 1307 11193 1325 38 11 1554 14784 1330 33 16 1146 08805 1342 40 19 1324 11476 1405 39 8 1730 17677 1424 37 12 1466 13508 1460 27 5 1417 13829 1559 35 9 1507 1440

10 1584 37 9 1599 155911 1591 33 8 1486 140712 1624 40 7 1817 192213 1634 35 11 1427 129714 1782 43 13 1687 172415 1894 33 3 1749 196716 1962 45 12 1839 198817 1967 45 4 2235 268718 2124 39 6 1826 194719 2270 45 5 2172 256320 2337 46 6 2186 2580

RMSE 183 275Data are numbers of signal responses in 50 trials

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 10: Rouder & Lu (2005)

582 ROUDER AND LU

terior f (μ |σ 2 w) According to the proportional formof Bayesrsquos theorem

All terms are conditioned on σ 2 to reflect the fact that itis known

The first step is to choose a prior distribution for μThe normal is a suitable choice

Parameters μ0 and σ 02 must be specified before analysis

according to the researcherrsquos beliefs about μ By choos-ing σ 0

2 to be increasingly large the researcher can makethe prior arbitrarily variable In the limit the prior ap-proaches having equal density across all values Thisprior is considered noninformative for μ

The density of the prior is given by

Note that the prior on μ does not depend on σ 2 there-fore f (μ) f (μ |σ2)

The next step is multiplying the likelihood f (w | μσ 2) by the prior density f (μ |σ 2) For normally distrib-uted data the likelihood is given by

(22)

Multiplying the equation above by the prior yields

We concern ourselves with only those terms that involveμ the parameter of interest Other terms are used to nor-malize the posterior density and are absorbed into theconstant of proportionality

where

(23)

and

(24)

The posterior is proportional to the density of a normalwith mean and variance given by μprime and σ 2prime respectivelyBecause the conditional posterior distribution must inte-grate to 10 this distribution must be that of a normalwith mean μprime and variance σ 2prime

(25)Unfortunately the notation does not make it clear thatboth μprime and σ 2prime depend on the value of σ 2 When this de-pendence is critical it will be made explicit μprime(σ 2) andσ 2prime(σ 2)

Because the posterior and prior are from the samefamily the normal distribution is the conjugate prior forthe population mean with normally distributed data (withknown variance) In the limit that the prior is diffuse (ieas its variance is made increasingly large) the posteriorof μ is a normal with μprime w and σ 2prime σ 2N This dis-tribution is also the sampling distribution of the samplemean in classical statistics Hence for the diffuse priorthe Bayesian and conventional approaches yield the sameresults

Figure 8 shows the estimation for an example in whichthe data are w (112 106 104 111) and the variance isassumed known to be 16 The mean of these four obser-vations is 10825 For demonstration purposes the prior

μ σ μ σ| ~ 2 2w Normal prime( )prime

prime = +⎛⎝⎜

⎞⎠⎟

minus minus primesumμ σ σ μ σ20

20

2w jj

σ σ σprime minus minus minus= +( )2 2

02 1

N

f μ σμ μ

σ| exp 2

2

22w( ) minus minus prime( )⎛

⎝⎜⎜

⎠⎟⎟prime

f N( | ) expμ σσ σ

μ22

02

212

1w minus +⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

⎣⎢⎢

⎝⎜

minusminussum

+⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪

+sum

+⎧⎨

22

0

02

2

202

02

w

w

jj

jj

σμσ

μ

σμσ

⎪⎪

⎩⎪

⎫⎬⎪

⎭⎪

⎦⎥⎥

⎠⎟

fw wj jjμ σ

μ μ

σ

μ

| exp22 2

212

2w( ) minus

minus +( )⎡

⎣⎢⎢

⎝⎜⎜

+

sum

220 0

2

02

2minus + ⎤

⎦⎥⎥

⎠⎟

μ μ μσ

fw j

j

μ σμ

σ

μ μ

| exp

exp

2

2

2

0

2w( ) minus minus( )⎛

⎜⎜

⎟⎟

timesminus minus

sum

(( )⎛

⎝⎜⎜

⎠⎟⎟

2

022σ

fw

N

j

j

μ σπσ

μ

σ| exp

2

2 2

2

21

2 2w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

timesminus minus( )⎛

⎝⎜⎜

⎠⎟⎟

1

2 20

0

2

02π σ

μ μ

σexp

fw

N

j

j

w | exp

μ σπσ

μ

σ2

2 2

2

21

2 2( ) =

( )minus minus( )⎛

⎜⎜

⎟sum ⎟⎟

f ( ) exp μπσ

μ μσ

=minus minus( )⎛

⎝⎜⎜

⎠⎟⎟

1

2 20

0

2

02

μ μ σ~ Normal 0 02( )

f f fμ σ μ σ μ σ| | | 2 2 2w w( ) ( ) ( )

HIERARCHICAL BAYESIAN MODELS 583

is constructed to be fairly informative with μ0 100and σ 0

2 100 The posterior is narrower than the priorreflecting the information gained from the data It is cen-tered at μprime 1079 with variance 385 Note that theposterior mean is not the sample mean but is modestlyinfluenced by the prior Whether the Bayesian or the con-ventional estimate is better depends on the accuracy ofthe prior For cases in which much is known about thedependent measure it may be advantageous to reflectthis information in the prior

Estimating σσ 2 With Known μμIn the last section the posterior for μ was derived for

known σ 2 In this section it is assumed that μ is knownand that the estimation of σ 2 is of primary interest Thegoal then is to estimate the posterior f (σ 2 | μ w) The inverse-gamma distribution is the conjugate prior for σ 2

with normally distributed data The inverse-gamma dis-tribution is not widely used in psychology As its nameimplies it is related to the gamma distribution If a ran-dom variable g is distributed as a gamma then 1g is dis-tributed as an inverse gamma An inverse-gamma prioron σ 2 has density

The parameters of the inverse gamma are a and b and inapplication these are chosen beforehand As a and b ap-proach 0 the inverse gamma approaches 1σ 2 which isconsidered the appropriate noninformative prior for σ 2

(Jeffreys 1961)

Multiplying the inverse-gamma prior by the likelihoodgiven in Equation 22 yields

Collecting only those terms dependent on σ 2 yields

where

aprime N2 a (26)and

(27)

The posterior is proportional to the density of an inversegamma with parameters aprime and bprime Because this posteriorintegrates to 10 it must also be an inverse gamma

σ 2 | μ w ~ Inverse Gamma (aprime bprime) (28)

Note that posterior parameter bprime depends explicitly onthe value of μ

MARKOV CHAIN MONTE CARLO SAMPLING

The preceding development highlights a problem Theuse of conjugate priors allowed for the derivation of pos-teriors only when some of the parameters were assumedto be known The posteriors in Equations 25 and 28 areknown as conditional posteriors because they depend onother parameters The quantities of primary interest arethe marginal posteriors μ | w and σ 2 | w The straightfor-

prime = minus( )⎡⎣⎢

⎤⎦⎥

+sumb w bj j μ2

2

f ba

σ μσ σ

2

2 1 21| exp w( )

( )minus prime⎛

⎝⎜⎞⎠⎟prime+

f

w b

N a

jj

σ μσ

μ

σ

2

2 2 1

2

2

1

5

|

exp

w( )

( )times minus

minus( ) +⎛

+ +

sum

⎝⎝

⎜⎜

⎟⎟

fw

N

j

j

σ μσ

μ

σ2

2 2

2

21

2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟⎟

times

sum

112 1 2

σ σ( )minus⎛

⎝⎜⎞⎠⎟+a

bexp

fw

N

j

j

σ μπσ

μ

σ2

2 2

2

21

2 2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

times( )

minus⎛⎝⎜

⎞⎠⎟+

b

a

ba

aΓ( )

exp σ σ2 1 2

f b

a

b a ba

σ σσ2

2 1 22 0( ) =

( )minus⎛

⎝⎜⎞⎠⎟

ge+

Γ( )exp

Figure 8 Prior and posterior distributions for estimating themean of normally distributed data with known variance Theprior and posterior are the dashed and solid distributions re-spectively The vertical line shows the sample mean The priorldquopullsrdquo the posterior slightly downward in this case

584 ROUDER AND LU

ward method of obtaining marginals is to integrate conditionals

In this case the integral is tractable but in many cases itis not When an integral is symbolically intractable how-ever it can often still be evaluated numerically One ofthe recent developments in numeric integration is Markovchain Monte Carlo (MCMC) sampling MCMC sam-pling has become quite popular in the last decade and isdescribed in detail in Gelfand and Smith (1990) We showhere how MCMC can be used to estimate the marginalposterior distributions of μ and σ 2 without assumingknown values We use a particular type of MCMC sam-pling known as Gibbs sampling (Geman amp Geman 1984)Gibbs sampling is a restricted form of MCMC the moregeneral form is known as MetropolisndashHastings MCMC

The goal is to find the marginal posterior distribu-tions MCMC provides a set of random samples from themarginal posterior distributions (rather than a closed-form derivation of the posterior density or cumulativedistribution function) Obtaining a set of random sam-ples from the marginal posteriors is sufficient for analy-sis With a sufficiently large sample from a random vari-able one can compute its density mean quantiles andany other statistic to arbitrary precision

To explain Gibbs sampling we will digress momen-tarily using a different example Consider the case ofgenerating samples from an ex-Gaussian random vari-able The ex-Gaussian is a well-known distribution incognitive psychology (Hohle 1965) It is the sum of nor-mal and exponential random variables The parametersare μ σ and τ the location and scale of the normal andthe scale of the exponential respectively An ex-Gaussianrandom variable x can be expressed in conditional form

andη ~ Exponential(τ )

We can take advantage of this conditional form to gen-erate ex-Gaussian samples First we generate a set ofsamples from η Samples will be denoted with squarebrackets [η]1 denotes the first sample [η]2 denotes thesecond and so on These samples can then be used in theconditional form to generate samples from the marginalrandom variable x For example the first two samples aregiven by

and

In this manner a whole set of samples from the mar-ginal x can be generated from the conditional randomvariable x |η Figure 9 shows that the method works Ahistogram of [x] generated in this manner converges tothe true ex-Gaussian density This is an example of MonteCarlo integration of a conditional density

We now return to the problem of deriving marginalposterior distributions of μ | w from the conditionalμ |σ 2 w If there was a set of samples from σ 2 | w MonteCarlo integration could be used to sample μ | w Like-wise if there was a set of samples of μ | w Monte Carlointegration could be used to sample σ 2 | w In Gibbs sam-pling these relations are used iteratively as followsFirst a value of [σ 2]1 is picked arbitrarily This value isthen used to sample the posterior of μ from the condi-tional in Equation 25

where μprime and σ 2 prime are the explicit functions of σ 2 givenin Equations 23 and 24 Next the value of [μ]1 is used tosample σ 2 from the conditional posterior in Equation 28In general for iteration m

(29)

and

(30)

In this manner sequences of random samples [μμ] and[σσ 2] are obtained

The initial samples of [μμ] and [σσ 2] certainly reflect thechoice of starting values [σ 2]1 and are not samples fromthe desired marginal posteriors It can be shown how-ever that under mild technical conditions the later sam-ples of [μμ] and [σσ 2] are samples from the desired mar-ginal posteriors (Tierney 1994) The initial region in

σ μ21

⎡⎣ ⎤⎦ prime prime ⎡⎣ ⎤⎦( )( )minusm ma b~ Inverse Gamma

[ ] μ μ σ σ σm m m~ Normal prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime2 2 2

μ μ σ σ σ⎡⎣ ⎤⎦ prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime1

2 2~ Normal1

2

1

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )2 22~ Normal +μ η σ

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )1 12~ Normal +μ η σ

x | ~ η μ η σNormal +( )2

f f f d( | ) | | μ μ σ σ σw w w= ( ) ( )int 2 2 2

Figure 9 An example of Monte Carlo integration of a normalconditional density to yield ex-Gaussian distributed samples Thehistogram presents the samples from a normal whose mean pa-rameter is distributed as an exponential The line is the densityof the appropriate ex-Gaussian distribution

HIERARCHICAL BAYESIAN MODELS 585

which the samples reflect the starting value is termed theburn-in period The later samples are the steady-state pe-riod Formally speaking the samples form irreducibleMarkov chains with stationary distributions that equalthe marginal posterior distribution (see Gilks Richard-son amp Spiegelhalter 1996) On a more informal levelthe first set of samples have random and arbitrary valuesAt some point by chance the random sample happens tobe from a high-density region of the true marginal pos-terior From this point forward the samples are from thetrue posteriors and may be used for estimation

Let m0 denote the point at which the samples are ap-proximately steady state Samples after m0 can be used toprovide both posterior means and credible regions of theposterior (as well as other statistics such as posteriorvariance) Posterior means are estimated by the arith-metic means of the samples

and

where M is the number of iterations Credible regions areconstructed as the centermost interval containing 95of the samples from m0 to M

To use MCMC to estimate posteriors of μ and σ 2 it isnecessary to sample from a normal and an inverse-gamma

distribution Sampling from a normal is provided in manysoftware packages and routines for low-level languagessuch as C or Fortran are readily available Samples froman inverse-gamma distribution may be obtained by tak-ing the reciprocal of samples from a gamma distributionAhrens and Dieter (1974 1982) provide algorithms forsampling the gamma distribution

To illustrate Gibbs sampling consider the case of es-timating IQ The hypothetical data for this example arefour replicates from a single participant w (112 106104 111) The classical estimation of μ is the samplemean (10825) and confidence intervals are constructedby multiplying the standard error (sw193) by the ap-propriate critical t value with three degrees of freedomFrom this multiplication the 95 confidence intervalfor μ is [1022 1144] Bayesian estimation was donewith Gibbs sampling Diffuse priors were placed on μ(μ0 0 σ 0

2 106) and σ 2 (a 106 b 106) Twochains of [μμ] are shown in the left panel of Figure 10each corresponding to a different initial value for [σ 2]1As can be seen both chains converge to their steady staterather quickly For this application there is very littleburn-in The histogram of the samples of [μμ] after burn-in is also shown (right panel) The histogram conforms toa t distribution with three degrees of freedom The poste-rior mean is 10826 and the 95 credible interval is[1022 1144] For normally distributed data the diffusepriors used in Bayesian analysis yield results that are nu-merically equivalent to their frequentist counterparts

The diffuse priors used in the previous case representthe situation in which researchers have no previous knowl-

σσ

2

2

0

0=

⎡⎣ ⎤⎦

minusgtsum

mm m

M m

μμ

=minus

gtsum [ ]m

m m

M m0

0

15

10

05

0

120

115

110

105

100

0 10 20 30 40 50 95 100 105 110 115 120IQ PointsIteration

Den

sity

IQ points

micro

Burn-in

Figure 10 Samples of μμ from Markov chain Monte Carlo integration for normally dis-tributed data The left panel shows two chains each from a poor choice of the initial value ofσσ 2 (the solid and dotted lines are for [σσ 2]1 10000 and [σσ 2]1 0001 respectively) Con-vergence is rapid The right panel shows the histogram of samples of μμ after burn-in Thedensity is the appropriate frequentist t distribution with three degrees of freedom

586 ROUDER AND LU

edge In the IQ example however much is known aboutthe distribution of IQ scores Suppose our test was normedfor a mean of 100 and a standard deviation of 15 Thisknowledge could be profitably used by setting μ0 100and σ 0

2 152 225 We may not know the particulartestndashretest correlation of our scales but we can surmisethat the standard deviations of replicates should be lessthan the population standard deviation After experiment-ing with the values of a and b we chose a 3 and b 100 The prior on σ 2 corresponding to these values isshown in Figure 11 It has much of its density below 100reflecting our belief that the testndashretest variability issmaller than the variability in the population Figure 11also shows a histogram of the posterior of μ (the posteriorwas obtained with Gibbs sampling as outlined above[σ 2]1 1 burn-in of 100 iterations) The posterior meanis 10795 which is somewhat below the sample mean of10825 This difference is from the prior people on aver-age score lower than the obtained scores Although it is abit hard to see in the figure the posterior distribution ismore narrow in the extreme tails than is the correspond-ing t distribution The 95 credible interval is [10211136] which is modestly different from the correspond-ing frequentist confidence interval This analysis shows

that the use of reasonable prior information can lead to aresult different from that of the frequentist analysis TheBayesian estimate is not only different but often more ef-ficient than the frequentist counterpart

The theoretical underpinnings of MCMC guaranteethat infinitely long chains will converge to the true pos-terior Researchers must decide how long to burn in chainsand then how long to continue in order to approximatethe posterior well The most important consideration inthis decision is the degree of correlation from one sam-ple to another In MCMC sampling consecutive samplesare not necessarily independent of one another The valueof a sample on cycle m 1 is in general dependent onthat of m If this dependence is small then convergencehappens quickly and relatively short chains are neededIf this dependence is large then convergence is muchslower One informal graphical method of assessing thedegree of dependence is to plot the autocorrelation func-tions of the chains The bottom right panel of Figure 11shows that for the normal application there appears to beno autocorrelation that is there is no dependence fromone sample to the next This fact implies that conver-gence is relatively rapid Good approximations may beobtained with short burn-ins and relatively short chains

Figure 11 Analysis with informative priors The two top panels showthe prior distributions on μμ and σσ 2 The bottom left panel shows the his-togram of samples of μμ after burn-in It differs from a t distribution fromfrequentist analysis in that it is shifted toward 100 and has a smallerupper tail The bottom right panel shows the autocorrelation functionfor μμ There is no discernible autocorrelation

HIERARCHICAL BAYESIAN MODELS 587

There are a number of formal procedures for assessingconvergence in the statistical literature (see eg Gel-man amp Rubin 1992 Geweke 1992 Raftery amp Lewis1992 see also Gelman et al 2004)

A PROBIT MODEL FOR ESTIMATING A PROBABILITY OF SUCCESS

The next step toward a hierarchical signal detectionmodel is to revisit estimation of a probability from indi-vidual trials Once again y successes are observed in Ntrials and y ~ Binomial(N p) where p is the parameterof interest In the previous section a beta prior was placedon parameter p and the conjugacy of the beta with the bi-nomial directly led to the posterior Unfortunately a betadistribution prior is not convenient as a prior for the sig-nal detection application

The signal detection model relies on the probit trans-form Figure 12 shows the probit transform which mapsprobabilities into z values using the normal inverse cu-mulative distribution function Whereas p ranges from 0to 1 z ranges across the real number line Let Φ denotethe standard normal cumulative distribution function Withthis notation p Φ(z) and conversely z Φ1( p)

Signal detection parameters are defined in terms ofprobit transforms

and

where p(h) and p(f ) are hit and false alarm probabilities re-spectively As a precursor to models of signal detectionwe develop estimation of the probit transform of a proba-bility of success z Φ1( p) The methods for doing sowill then be used in the signal detection application

When using a probit transform we model the y suc-cesses in N trials as

y ~ Binomial[N Φ(z)]

where z is the parameter of interestWhen using a probit it is convenient to place a normal

prior on z

(31)

Figure 13 shows various normal priors on z and by trans-form the corresponding prior on p Consider the specialcase when a standard normal is placed on z As is shownin the figure the corresponding prior on p is flat A flatprior also corresponds to a beta with a b 1 (see Fig-ure 6) From the previous development if a beta prior isplaced on p the posterior is also a beta with parametersaprime a y and bprime b (N y) Noting for a flat priorthat a b 1 it is expected that the posterior of p is dis-tributed as a beta with parameters aprime 1 y and bprime 1 (N y)

The next step is to derive the posterior f (z |Y ) for anormal prior on z The straightforward approach is tonote that the posterior is proportional to the likelihoodmultiplied by the prior

This equation is intractable Albert and Chib (1995) pro-vide an alternative for analysis and the details of this al-ternative are provided in Appendix A The basic idea isthat a new set of latent variables is defined upon which zis conditioned Conditional posteriors for both z andthese new latent variables are easily derived and sam-pled These samples can then be used in Gibbs samplingto find a marginal posterior for z

In this application it may be desirable to estimate theposterior of p as well as that of z This estimation is easyin MCMC Because p is the inverse probit transform ofz samples can also be inversely transformed Let [z] bethe chain of samples from the posterior of z A chain ofposterior samples of p is given by [ p]m Φ( [z]m)

The top right panel of Figure 14 shows the histogramof the posterior of p for y 7 successes on N 10 tri-als The prior parameters were μ0 0 and σ 0

2 1 so theprior on p was flat (see Figure 13) Because the prior isflat it corresponds to a beta with parameters (a 1 b 1) Consequently the expected posterior is a beta withdensity aprime y 1 8 and bprime (N y) 1 4 Thechains in the Gibbs sampler were started from a numberof values of [z]1 and convergence was always rapid Theresulting histogram for 10000 samples is plotted it ap-proximates the expected distribution The bottom panelshows a small degree of autocorrelation This autocorre-lation can be overcome by running longer chains andusing a more conservative burn-in period Chains of

f z Y z z

z

Y N Y( | ) ( ) [ ( )]

exp

Φ Φ1

20

2

02 1

minus

times minus minus( ) ( )minus

minusμ σ⎡⎡

⎣⎢⎤⎦⎥

z ~ Normal 0μ σ02( )

c p= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pΦ Φ1 1( ) ( )h f

Figure 12 Probit transform of probability ( p) into z scores (z)The transform is the inverse cumulative distribution function ofthe normal distribution Lines show the transform of p 69 intoz 05

588 ROUDER AND LU

1000 samples with 100 samples of burn-in are morethan sufficient for this application

ESTIMATING PROBABILITY FORSEVERAL PARTICIPANTS

In this section we propose a model for estimating sev-eral participantsrsquo probability of success The main goalis to introduce hierarchical models within the context ofa relatively simple example The methods and insightsdiscussed here are directly applicable to the hierarchicalmodel of signal detection Consider data from a numberof participants each with his or her own unique proba-bility of success pi In conventional estimation each ofthese probabilities is estimated separately usually by theestimator piyi Ni where yi and Ni are the number ofsuccesses and trials respectively for the ith participantWe show here that hierarchical modeling leads to moreaccurate estimation

In a hierarchical model it is assumed that although in-dividuals vary in ability the distribution governing indi-vidualsrsquo true abilities is orderly We term this distributionthe parent distribution If we knew the parent distributionthis prior information would be useful in estimationConsider the example in Figure 15 Suppose 10 individ-uals each performed 10 trials Hypothetical performanceis indicated with Xs Letrsquos suppose the task is relativelyeasy so that individualsrsquo true probabilities of successrange from 7 to 10 as indicated by the curve in the fig-

ure One participant does poorly however succeeding ononly 3 trials The conventional estimate for this poor-performing participant is 3 which is a poor estimate ofthe true probability (between 7 and 10) If the parentdistribution is known it would be evident that this poorscore is influenced by a large degree of sample noise andshould be corrected upward as indicated by the arrowThe new estimate which is more accurate is denoted byan H (for hierarchical)

In the example above we assumed a parent distribu-tion In actuality such assumptions are unwarranted Inhierarchical models parent distributions are estimatedfrom the data these parents in turn affect the estimatesfrom outlying data The estimated parent for the data inFigure 15 would have much of its mass in the upperranges When serving as a prior the estimated parentwould exert an upward tendency in the estimation of theoutlying participantrsquos probability leading to better esti-mation In Bayesian analysis the method of implement-ing a hierarchical model is to use a hierarchical priorThe parent distribution serves as a prior for each indi-vidual and this parent is termed the first stage of theprior The parent distribution is not fully specified butinstead free parameters of the parent are estimated fromthe data These parameters also need a prior on themwhich is termed the second stage of the prior

The probit transform model is ideally suited for a hi-erarchical prior A normally distributed parent may beplaced on transformed probabilities of success The ith

Prior 1 Prior 2 Prior 3

Priorson z

Corresponding Priors on p

N(01) N(11) N(0 2)4

3

2

1

0

4

3

2

1

0

4

3

2

1

0

ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6

5

4

3

2

1

0

5

4

3

2

1

0

5

4

3

2

1

0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Figure 13 Three normally distributed priors on z and the corresponding priors on p Priors on z were obtainedby drawing 100000 samples from the normal Priors on p were obtained by transforming each sample of z

HIERARCHICAL BAYESIAN MODELS 589

participantrsquos true ability zi is simply a sample from thisnormal The number of successes is modeled as

(32)

with a parent distribution (first-stage prior) given by

(33)

In this case parameters μ and σ 2 describe the locationand variance of the parent distribution and reflect group-level properties Priors are needed on μ and σ 2 (second-stage priors) Suitable choices are

(34)

andσ 2 ~ Inverse Gamma (a b) (35)

Values of μ0 σ 02 a and b must be specif ied before

analysisWe performed an analysis on hypothetical data in order

to demonstrate the hierarchical model Table 2 shows hy-pothetical data for a set of 20 participants performing 50trials These success frequencies were generated from abinomial but each individual had a unique true proba-bility of success These true probabilities2 which rangedbetween 59 and 80 are shown in the second column ofTable 2 Parameter estimators p0 (Equation 12) and p2(Equation 14) are shown

All of the theoretical results needed to derive the con-ditional posterior distributions have already been pre-

sented The Albert and Chib (1995) analysis for this ap-plication is provided in Appendix B A coded version ofMCMC for this model is presented in Appendix C

The MCMC chain was run for 10000 iterations Pri-ors were set to be fairly diffuse (μ0 0 σ 0

2 1000 a b 1) All parameters converged to steady-state behav-ior quickly (burn-in was conservative at 100 iterations)Autocorrelation of samples of pi for 2 select participantsis shown in the bottom panels of Figure 16 The auto-correlation in these plots was typical of other param-eters Autocorrelation for this model was no worse thanthat for the nonhierarchical Bayesian model of the pre-ceding section The top left panel of Figure 16 showsposterior means of the probability of success as a func-tion of the true probability of success for the hierarchi-cal model (filled circles) Also included are classical estimates from p0 (filled circles) Points from the hier-archical model are on average closer to the diagonalAs is indicated in Table 2 the hierarchical model haslower root mean squared error than does its frequentistcounterpart p0 The top right panel shows the hierarchi-cal estimates as a function of p0 The effect of the hier-archical structure is clear The extreme estimates in thefrequentist approach are moderated in the Bayesiananalysis

Estimator p2 also moderates extreme values The ten-dency is to pull them toward the middle specifically es-timates are closer to 5 than they are with p0 For the dataof Table 2 the difference in efficiency between p2 andp0 is exceedingly small (about 0001) The reason thatthere was little gain for p2 is that it pulled estimates to-ward a value of 5 which is not the mean of the trueprobabilities The mean was p 69 and for this valueand these sample sizes estimators p2 and p0 have aboutthe same efficiency The hierarchical estimator is supe-

μ μ σ~ Normal 02( )

zi | μ σ μ σ2 ind~ Normal 2( )

y z N zi i i i| ind~ Binomial Φ( )⎡⎣ ⎤⎦

Figure 14 (Top) Histograms of the post-burn-in samples of zand p respectively for 7 successes out of 10 trials The line fit forp is the true posterior a beta distribution with a primeprime 8 and bprimeprime 4(Bottom) Autocorrelation function for z There is a small degreeof autocorrelation

Figure 15 Example of how a parent distribution affects esti-mation The outlying observation is incompatible with the parentdistribution Bayesian estimation allows for the influence of theparent as a prior The resulting estimate is shifted upward re-ducing error

590 ROUDER AND LU

rior to these nonhierarchical estimators because it pullsoutlying estimates toward the mean value p Of coursethe advantage of the hierarchical model is a function ofsample size With increasing sample sizes the amountof pulling decreases In the limit all three estimatorsconverge to the true individualrsquos probability

A HIERARCHICAL SIGNAL DETECTION MODEL

Our ultimate goal is a signal detection model that ac-counts for both participant and item variability As a pre-cursor we present a hierarchical signal detection modelwithout item variability the model will be expanded inthe next section to include item variability The modelhere is applicable to psychophysical experiments in whichsignal and noise trials are identical replicates respectively

Signal detection parameters are simple subtractions ofprobit-transformed probabilities Let d primei and ci denotesignal detection parameters for the ith participant

and

where pi(h) and p

i(f ) are individualsrsquo true hit and false

alarm probabilities respectively Let hi and fi be the pro-bit transforms hi1( p i

(h)) and fi1( pi(f )) Then

sensitivity and criteria are given by

d primei hi fiand

ci fi

The tactic we follow is to place hierarchical models on hand f rather than on signal detection parameters directlyBecause the relationship between signal detection pa-rameters and probit-transformed probabilities is linearaccurate estimation of hi and fi results in accurate esti-mation of d primei and ci

The model is developed as follows Let Ni and Si de-note the number of noise and signal trials given to the ithparticipant Let y

i(h) and y

i(f ) denote the the number of

hits and false alarms respectively The model for yi(h) and

yi(f ) is given by

and

It is assumed that each participantrsquos parameters aredrawn from parent distributions These parent distribu-tions serve as the first stage of the hierarchical prior andare given by

(36)

and

(37)

Second-stage priors are placed on parent distribution pa-rameters (μk σ k

2) as

(38)

and

(39)σ k k ka b k2 ~ Inverse Gamma h f( ) =

μk k ku v k~ Normal h f( ) =

f iind~ Normal f fμ σ 2( )

hiind~ Normal h hμ σ 2( )

y f N fi i i i( ) | f ind

~ Binomial Φ( )⎡⎣ ⎤⎦

y h S hi i i i( ) | h ind

~ Binomial Φ( )⎡⎣ ⎤⎦

c pi i= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pi i iΦ Φ1 1( ) ( )h f

Table 2Data and Estimates of a Probability of Success

Participant True Prob Data p0 p2 Hierarchical

1 587 33 66 654 6722 621 28 56 558 6153 633 34 68 673 6834 650 29 58 577 6275 659 32 64 635 6606 665 37 74 731 7167 667 30 60 596 6388 667 29 58 577 6279 687 34 68 673 684

10 693 34 68 673 68211 696 34 68 673 68312 701 37 74 731 71513 713 38 76 750 72814 734 38 76 750 72715 736 37 74 731 71716 738 33 66 654 67217 743 37 74 731 71618 759 33 66 654 67219 786 38 76 750 72820 803 42 84 827 771

RMSE 053 053 041Data are numbers of successes in 50 trials

HIERARCHICAL BAYESIAN MODELS 591

Analysis takes place in two steps The first is to obtainposterior samples from hi and fi for each participantThese are probit-transformed parameters that may be es-timated following the exact procedure in the previoussection Separate runs are made for signal trials (produc-ing estimates of hi) and for noise trials (producing esti-mates of fi) These posterior samples are denoted [hi] and[ fi] respectively The second step is to use these posteriorsamples to estimate posterior distributions of individuald primei and ci This is easily accomplished For all samples

(40)

and(41)

Hypothetical data and parameter estimates are shownin Table 3 Individual variability in the data was imple-mented by sampling each participantrsquos d prime and c from alog-normal and from a normal distribution respectivelyTrue d prime values are shown and they range from 123 to234 From these true values of d prime and c true individualhit and false alarm probabilities were computed and in-dividual hit and false alarm frequencies were sampledfrom a binomial Hit and false alarm frequencies wereproduced in this manner for 20 hypothetical individualseach observing 50 signal and 50 noise trials

A conventional sensitivity estimate for each individ-ual was obtained using the formula

Φ1(Hit rate) Φ1(False alarm rate)

The resulting estimates are shown in the last column ofTable 3 labeled Conventional The hierarchical signaldetection model was analyzed with diffuse prior param-eters ah af bh bf 01 uh uf 0 and vh vf 1000 The chain was run for 10000 iterations with thefirst 500 serving as burn-in This choice of burn-in pe-riod is very conservative Autocorrelation for sensitivityfor 2 select participants is displayed in the bottom pan-els of Figure 17 this degree of autocorrelation was typ-ical for all parameters Autocorrelation was about thesame as in the previous application and the resulting es-timates are shown in the next-to-last column of Table 3

The top left panel of Figure 17 shows a plot of esti-mated dprime as a function of true d prime Estimates from the hi-erarchical model are closer to the diagonal than are theconventional estimates indicating more accurate esti-mation Table 3 also provides efficiency information thehierarchical model is 33 more efficient than the con-ventional estimates (this amount is typical of repeatedruns of the same simulation) The reason for this gain isevident in the top right panel of Figure 17 The extremeestimates in the conventional method which most likely

c fi m i m⎡⎣ ⎤⎦ = minus ⎡⎣ ⎤⎦

prime⎡⎣ ⎤⎦ = ⎡⎣ ⎤⎦ minus ⎡⎣ ⎤⎦d h fi m i m m

ConventionalHierarchical

Participant 5 Participant 15

True Probability Conventional Estimates

Lag Lag

5 6 7 8 9 5 6 7 8 9

0 10 20 30 40 0 10 20 30 40

Aut

ocor

rela

tion

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

9

8

7

6

5

8

4

0

9

8

7

6

5

8

4

0

Figure 16 Modeling the probability of success across several partici-pants (Top left) Probability estimates as functions of true probabilitythe triangles represent estimator p0 and the circles represent posteriormeans of pi from the hierarchical model (Top right) Hierarchical esti-mates as a function of p0 (Bottom) Autocorrelation functions of the pos-terior probability estimates for 2 participants from the hierarchicalmodel The degree of autocorrelation is typical of the other participants

592 ROUDER AND LU

reflect sample noise are shrunk to more moderate esti-mates in the hierarchical model

SIGNAL DETECTION WITH ITEMVARIABILITY

In the introduction we stressed that Bayesian hierar-chical modeling may provide a solution for the statisti-cal problems associated with unmodeled variability innonlinear contexts In all of the previous examples thegains from hierarchical modeling were in better effi-ciency Conventional analysis although not as accurateas Bayesian analysis certainly yielded consistent if notunbiased estimates In this section item variability isadded to signal detection Conventional analyses inwhich scores are aggregated over items result in incon-sistent estimates characterized by asymptotic bias Inthis section we show that in contrast to conventionalanalysis Bayesian hierarchical models provide consis-tent and accurate estimation

Our route to the model is a bit circuitous Our first at-tempt based on a straightforward extension of the pre-vious techniques fails because of an excessive amountof autocorrelation The solution to this failure is to use apowerful and recently developed sampling schemeblocked sampling (Roberts amp Sahu 1997) First we willproceed with the straightforward but problematic exten-sion Then we will introduce blocked sampling and showhow it leads to tractable analysis

Straightforward ExtensionIt is straightforward to specify models that include

item effects The previous notation is here expanded byindexing hits and false alarms by both participants and

items Let yij(h) and yij

(f ) denote the response of the ith per-son to the jth item Values of yij

(h) and yij(f ) are dichoto-

mous indicating whether the response was old or newLet pij

(h) and pij(f ) be true probabilities that yij

(h) 1 andyij

(f ) 1 respectively that is these probabilities are truehit and false alarm rates for a given participant-by-itemcombination respectively Let hij and fij be the probittransforms [hij Φ1( pij

(h)) fij Φ1( pij(f ))] The model

is given by

(42)

and

(43)

Linear models are placed on hij and fij

hij μh αi γj (44)

and

fij μf βi δj (45)

Parameters μh and μf correspond to the grand means ofthe hit and false alarm rates respectively Parameters ααand ββ denote participant effects on hits and false alarms respectively parameters γγ and δδ denote item effects onhits and false alarms respectively These participant anditem effects are assumed to be zero-centered random ef-fects Grand mean signal detection parameters denotedd prime and c are given by

d prime μh μf (46)

and

c μf (47)

y fij ij( ) f ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

y hij ij( )h ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

Table 3Data and Estimates of Sensitivity

FalseParticipant True d prime Hits Alarms Hierarchical Conventional

1 1238 35 7 1608 16052 1239 38 17 1307 11193 1325 38 11 1554 14784 1330 33 16 1146 08805 1342 40 19 1324 11476 1405 39 8 1730 17677 1424 37 12 1466 13508 1460 27 5 1417 13829 1559 35 9 1507 1440

10 1584 37 9 1599 155911 1591 33 8 1486 140712 1624 40 7 1817 192213 1634 35 11 1427 129714 1782 43 13 1687 172415 1894 33 3 1749 196716 1962 45 12 1839 198817 1967 45 4 2235 268718 2124 39 6 1826 194719 2270 45 5 2172 256320 2337 46 6 2186 2580

RMSE 183 275Data are numbers of signal responses in 50 trials

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 11: Rouder & Lu (2005)

HIERARCHICAL BAYESIAN MODELS 583

is constructed to be fairly informative with μ0 100and σ 0

2 100 The posterior is narrower than the priorreflecting the information gained from the data It is cen-tered at μprime 1079 with variance 385 Note that theposterior mean is not the sample mean but is modestlyinfluenced by the prior Whether the Bayesian or the con-ventional estimate is better depends on the accuracy ofthe prior For cases in which much is known about thedependent measure it may be advantageous to reflectthis information in the prior

Estimating σσ 2 With Known μμIn the last section the posterior for μ was derived for

known σ 2 In this section it is assumed that μ is knownand that the estimation of σ 2 is of primary interest Thegoal then is to estimate the posterior f (σ 2 | μ w) The inverse-gamma distribution is the conjugate prior for σ 2

with normally distributed data The inverse-gamma dis-tribution is not widely used in psychology As its nameimplies it is related to the gamma distribution If a ran-dom variable g is distributed as a gamma then 1g is dis-tributed as an inverse gamma An inverse-gamma prioron σ 2 has density

The parameters of the inverse gamma are a and b and inapplication these are chosen beforehand As a and b ap-proach 0 the inverse gamma approaches 1σ 2 which isconsidered the appropriate noninformative prior for σ 2

(Jeffreys 1961)

Multiplying the inverse-gamma prior by the likelihoodgiven in Equation 22 yields

Collecting only those terms dependent on σ 2 yields

where

aprime N2 a (26)and

(27)

The posterior is proportional to the density of an inversegamma with parameters aprime and bprime Because this posteriorintegrates to 10 it must also be an inverse gamma

σ 2 | μ w ~ Inverse Gamma (aprime bprime) (28)

Note that posterior parameter bprime depends explicitly onthe value of μ

MARKOV CHAIN MONTE CARLO SAMPLING

The preceding development highlights a problem Theuse of conjugate priors allowed for the derivation of pos-teriors only when some of the parameters were assumedto be known The posteriors in Equations 25 and 28 areknown as conditional posteriors because they depend onother parameters The quantities of primary interest arethe marginal posteriors μ | w and σ 2 | w The straightfor-

prime = minus( )⎡⎣⎢

⎤⎦⎥

+sumb w bj j μ2

2

f ba

σ μσ σ

2

2 1 21| exp w( )

( )minus prime⎛

⎝⎜⎞⎠⎟prime+

f

w b

N a

jj

σ μσ

μ

σ

2

2 2 1

2

2

1

5

|

exp

w( )

( )times minus

minus( ) +⎛

+ +

sum

⎝⎝

⎜⎜

⎟⎟

fw

N

j

j

σ μσ

μ

σ2

2 2

2

21

2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟⎟

times

sum

112 1 2

σ σ( )minus⎛

⎝⎜⎞⎠⎟+a

bexp

fw

N

j

j

σ μπσ

μ

σ2

2 2

2

21

2 2| exp

w( )

( )minus minus( )⎛

⎜⎜

⎟sum⎟⎟

times( )

minus⎛⎝⎜

⎞⎠⎟+

b

a

ba

aΓ( )

exp σ σ2 1 2

f b

a

b a ba

σ σσ2

2 1 22 0( ) =

( )minus⎛

⎝⎜⎞⎠⎟

ge+

Γ( )exp

Figure 8 Prior and posterior distributions for estimating themean of normally distributed data with known variance Theprior and posterior are the dashed and solid distributions re-spectively The vertical line shows the sample mean The priorldquopullsrdquo the posterior slightly downward in this case

584 ROUDER AND LU

ward method of obtaining marginals is to integrate conditionals

In this case the integral is tractable but in many cases itis not When an integral is symbolically intractable how-ever it can often still be evaluated numerically One ofthe recent developments in numeric integration is Markovchain Monte Carlo (MCMC) sampling MCMC sam-pling has become quite popular in the last decade and isdescribed in detail in Gelfand and Smith (1990) We showhere how MCMC can be used to estimate the marginalposterior distributions of μ and σ 2 without assumingknown values We use a particular type of MCMC sam-pling known as Gibbs sampling (Geman amp Geman 1984)Gibbs sampling is a restricted form of MCMC the moregeneral form is known as MetropolisndashHastings MCMC

The goal is to find the marginal posterior distribu-tions MCMC provides a set of random samples from themarginal posterior distributions (rather than a closed-form derivation of the posterior density or cumulativedistribution function) Obtaining a set of random sam-ples from the marginal posteriors is sufficient for analy-sis With a sufficiently large sample from a random vari-able one can compute its density mean quantiles andany other statistic to arbitrary precision

To explain Gibbs sampling we will digress momen-tarily using a different example Consider the case ofgenerating samples from an ex-Gaussian random vari-able The ex-Gaussian is a well-known distribution incognitive psychology (Hohle 1965) It is the sum of nor-mal and exponential random variables The parametersare μ σ and τ the location and scale of the normal andthe scale of the exponential respectively An ex-Gaussianrandom variable x can be expressed in conditional form

andη ~ Exponential(τ )

We can take advantage of this conditional form to gen-erate ex-Gaussian samples First we generate a set ofsamples from η Samples will be denoted with squarebrackets [η]1 denotes the first sample [η]2 denotes thesecond and so on These samples can then be used in theconditional form to generate samples from the marginalrandom variable x For example the first two samples aregiven by

and

In this manner a whole set of samples from the mar-ginal x can be generated from the conditional randomvariable x |η Figure 9 shows that the method works Ahistogram of [x] generated in this manner converges tothe true ex-Gaussian density This is an example of MonteCarlo integration of a conditional density

We now return to the problem of deriving marginalposterior distributions of μ | w from the conditionalμ |σ 2 w If there was a set of samples from σ 2 | w MonteCarlo integration could be used to sample μ | w Like-wise if there was a set of samples of μ | w Monte Carlointegration could be used to sample σ 2 | w In Gibbs sam-pling these relations are used iteratively as followsFirst a value of [σ 2]1 is picked arbitrarily This value isthen used to sample the posterior of μ from the condi-tional in Equation 25

where μprime and σ 2 prime are the explicit functions of σ 2 givenin Equations 23 and 24 Next the value of [μ]1 is used tosample σ 2 from the conditional posterior in Equation 28In general for iteration m

(29)

and

(30)

In this manner sequences of random samples [μμ] and[σσ 2] are obtained

The initial samples of [μμ] and [σσ 2] certainly reflect thechoice of starting values [σ 2]1 and are not samples fromthe desired marginal posteriors It can be shown how-ever that under mild technical conditions the later sam-ples of [μμ] and [σσ 2] are samples from the desired mar-ginal posteriors (Tierney 1994) The initial region in

σ μ21

⎡⎣ ⎤⎦ prime prime ⎡⎣ ⎤⎦( )( )minusm ma b~ Inverse Gamma

[ ] μ μ σ σ σm m m~ Normal prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime2 2 2

μ μ σ σ σ⎡⎣ ⎤⎦ prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime1

2 2~ Normal1

2

1

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )2 22~ Normal +μ η σ

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )1 12~ Normal +μ η σ

x | ~ η μ η σNormal +( )2

f f f d( | ) | | μ μ σ σ σw w w= ( ) ( )int 2 2 2

Figure 9 An example of Monte Carlo integration of a normalconditional density to yield ex-Gaussian distributed samples Thehistogram presents the samples from a normal whose mean pa-rameter is distributed as an exponential The line is the densityof the appropriate ex-Gaussian distribution

HIERARCHICAL BAYESIAN MODELS 585

which the samples reflect the starting value is termed theburn-in period The later samples are the steady-state pe-riod Formally speaking the samples form irreducibleMarkov chains with stationary distributions that equalthe marginal posterior distribution (see Gilks Richard-son amp Spiegelhalter 1996) On a more informal levelthe first set of samples have random and arbitrary valuesAt some point by chance the random sample happens tobe from a high-density region of the true marginal pos-terior From this point forward the samples are from thetrue posteriors and may be used for estimation

Let m0 denote the point at which the samples are ap-proximately steady state Samples after m0 can be used toprovide both posterior means and credible regions of theposterior (as well as other statistics such as posteriorvariance) Posterior means are estimated by the arith-metic means of the samples

and

where M is the number of iterations Credible regions areconstructed as the centermost interval containing 95of the samples from m0 to M

To use MCMC to estimate posteriors of μ and σ 2 it isnecessary to sample from a normal and an inverse-gamma

distribution Sampling from a normal is provided in manysoftware packages and routines for low-level languagessuch as C or Fortran are readily available Samples froman inverse-gamma distribution may be obtained by tak-ing the reciprocal of samples from a gamma distributionAhrens and Dieter (1974 1982) provide algorithms forsampling the gamma distribution

To illustrate Gibbs sampling consider the case of es-timating IQ The hypothetical data for this example arefour replicates from a single participant w (112 106104 111) The classical estimation of μ is the samplemean (10825) and confidence intervals are constructedby multiplying the standard error (sw193) by the ap-propriate critical t value with three degrees of freedomFrom this multiplication the 95 confidence intervalfor μ is [1022 1144] Bayesian estimation was donewith Gibbs sampling Diffuse priors were placed on μ(μ0 0 σ 0

2 106) and σ 2 (a 106 b 106) Twochains of [μμ] are shown in the left panel of Figure 10each corresponding to a different initial value for [σ 2]1As can be seen both chains converge to their steady staterather quickly For this application there is very littleburn-in The histogram of the samples of [μμ] after burn-in is also shown (right panel) The histogram conforms toa t distribution with three degrees of freedom The poste-rior mean is 10826 and the 95 credible interval is[1022 1144] For normally distributed data the diffusepriors used in Bayesian analysis yield results that are nu-merically equivalent to their frequentist counterparts

The diffuse priors used in the previous case representthe situation in which researchers have no previous knowl-

σσ

2

2

0

0=

⎡⎣ ⎤⎦

minusgtsum

mm m

M m

μμ

=minus

gtsum [ ]m

m m

M m0

0

15

10

05

0

120

115

110

105

100

0 10 20 30 40 50 95 100 105 110 115 120IQ PointsIteration

Den

sity

IQ points

micro

Burn-in

Figure 10 Samples of μμ from Markov chain Monte Carlo integration for normally dis-tributed data The left panel shows two chains each from a poor choice of the initial value ofσσ 2 (the solid and dotted lines are for [σσ 2]1 10000 and [σσ 2]1 0001 respectively) Con-vergence is rapid The right panel shows the histogram of samples of μμ after burn-in Thedensity is the appropriate frequentist t distribution with three degrees of freedom

586 ROUDER AND LU

edge In the IQ example however much is known aboutthe distribution of IQ scores Suppose our test was normedfor a mean of 100 and a standard deviation of 15 Thisknowledge could be profitably used by setting μ0 100and σ 0

2 152 225 We may not know the particulartestndashretest correlation of our scales but we can surmisethat the standard deviations of replicates should be lessthan the population standard deviation After experiment-ing with the values of a and b we chose a 3 and b 100 The prior on σ 2 corresponding to these values isshown in Figure 11 It has much of its density below 100reflecting our belief that the testndashretest variability issmaller than the variability in the population Figure 11also shows a histogram of the posterior of μ (the posteriorwas obtained with Gibbs sampling as outlined above[σ 2]1 1 burn-in of 100 iterations) The posterior meanis 10795 which is somewhat below the sample mean of10825 This difference is from the prior people on aver-age score lower than the obtained scores Although it is abit hard to see in the figure the posterior distribution ismore narrow in the extreme tails than is the correspond-ing t distribution The 95 credible interval is [10211136] which is modestly different from the correspond-ing frequentist confidence interval This analysis shows

that the use of reasonable prior information can lead to aresult different from that of the frequentist analysis TheBayesian estimate is not only different but often more ef-ficient than the frequentist counterpart

The theoretical underpinnings of MCMC guaranteethat infinitely long chains will converge to the true pos-terior Researchers must decide how long to burn in chainsand then how long to continue in order to approximatethe posterior well The most important consideration inthis decision is the degree of correlation from one sam-ple to another In MCMC sampling consecutive samplesare not necessarily independent of one another The valueof a sample on cycle m 1 is in general dependent onthat of m If this dependence is small then convergencehappens quickly and relatively short chains are neededIf this dependence is large then convergence is muchslower One informal graphical method of assessing thedegree of dependence is to plot the autocorrelation func-tions of the chains The bottom right panel of Figure 11shows that for the normal application there appears to beno autocorrelation that is there is no dependence fromone sample to the next This fact implies that conver-gence is relatively rapid Good approximations may beobtained with short burn-ins and relatively short chains

Figure 11 Analysis with informative priors The two top panels showthe prior distributions on μμ and σσ 2 The bottom left panel shows the his-togram of samples of μμ after burn-in It differs from a t distribution fromfrequentist analysis in that it is shifted toward 100 and has a smallerupper tail The bottom right panel shows the autocorrelation functionfor μμ There is no discernible autocorrelation

HIERARCHICAL BAYESIAN MODELS 587

There are a number of formal procedures for assessingconvergence in the statistical literature (see eg Gel-man amp Rubin 1992 Geweke 1992 Raftery amp Lewis1992 see also Gelman et al 2004)

A PROBIT MODEL FOR ESTIMATING A PROBABILITY OF SUCCESS

The next step toward a hierarchical signal detectionmodel is to revisit estimation of a probability from indi-vidual trials Once again y successes are observed in Ntrials and y ~ Binomial(N p) where p is the parameterof interest In the previous section a beta prior was placedon parameter p and the conjugacy of the beta with the bi-nomial directly led to the posterior Unfortunately a betadistribution prior is not convenient as a prior for the sig-nal detection application

The signal detection model relies on the probit trans-form Figure 12 shows the probit transform which mapsprobabilities into z values using the normal inverse cu-mulative distribution function Whereas p ranges from 0to 1 z ranges across the real number line Let Φ denotethe standard normal cumulative distribution function Withthis notation p Φ(z) and conversely z Φ1( p)

Signal detection parameters are defined in terms ofprobit transforms

and

where p(h) and p(f ) are hit and false alarm probabilities re-spectively As a precursor to models of signal detectionwe develop estimation of the probit transform of a proba-bility of success z Φ1( p) The methods for doing sowill then be used in the signal detection application

When using a probit transform we model the y suc-cesses in N trials as

y ~ Binomial[N Φ(z)]

where z is the parameter of interestWhen using a probit it is convenient to place a normal

prior on z

(31)

Figure 13 shows various normal priors on z and by trans-form the corresponding prior on p Consider the specialcase when a standard normal is placed on z As is shownin the figure the corresponding prior on p is flat A flatprior also corresponds to a beta with a b 1 (see Fig-ure 6) From the previous development if a beta prior isplaced on p the posterior is also a beta with parametersaprime a y and bprime b (N y) Noting for a flat priorthat a b 1 it is expected that the posterior of p is dis-tributed as a beta with parameters aprime 1 y and bprime 1 (N y)

The next step is to derive the posterior f (z |Y ) for anormal prior on z The straightforward approach is tonote that the posterior is proportional to the likelihoodmultiplied by the prior

This equation is intractable Albert and Chib (1995) pro-vide an alternative for analysis and the details of this al-ternative are provided in Appendix A The basic idea isthat a new set of latent variables is defined upon which zis conditioned Conditional posteriors for both z andthese new latent variables are easily derived and sam-pled These samples can then be used in Gibbs samplingto find a marginal posterior for z

In this application it may be desirable to estimate theposterior of p as well as that of z This estimation is easyin MCMC Because p is the inverse probit transform ofz samples can also be inversely transformed Let [z] bethe chain of samples from the posterior of z A chain ofposterior samples of p is given by [ p]m Φ( [z]m)

The top right panel of Figure 14 shows the histogramof the posterior of p for y 7 successes on N 10 tri-als The prior parameters were μ0 0 and σ 0

2 1 so theprior on p was flat (see Figure 13) Because the prior isflat it corresponds to a beta with parameters (a 1 b 1) Consequently the expected posterior is a beta withdensity aprime y 1 8 and bprime (N y) 1 4 Thechains in the Gibbs sampler were started from a numberof values of [z]1 and convergence was always rapid Theresulting histogram for 10000 samples is plotted it ap-proximates the expected distribution The bottom panelshows a small degree of autocorrelation This autocorre-lation can be overcome by running longer chains andusing a more conservative burn-in period Chains of

f z Y z z

z

Y N Y( | ) ( ) [ ( )]

exp

Φ Φ1

20

2

02 1

minus

times minus minus( ) ( )minus

minusμ σ⎡⎡

⎣⎢⎤⎦⎥

z ~ Normal 0μ σ02( )

c p= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pΦ Φ1 1( ) ( )h f

Figure 12 Probit transform of probability ( p) into z scores (z)The transform is the inverse cumulative distribution function ofthe normal distribution Lines show the transform of p 69 intoz 05

588 ROUDER AND LU

1000 samples with 100 samples of burn-in are morethan sufficient for this application

ESTIMATING PROBABILITY FORSEVERAL PARTICIPANTS

In this section we propose a model for estimating sev-eral participantsrsquo probability of success The main goalis to introduce hierarchical models within the context ofa relatively simple example The methods and insightsdiscussed here are directly applicable to the hierarchicalmodel of signal detection Consider data from a numberof participants each with his or her own unique proba-bility of success pi In conventional estimation each ofthese probabilities is estimated separately usually by theestimator piyi Ni where yi and Ni are the number ofsuccesses and trials respectively for the ith participantWe show here that hierarchical modeling leads to moreaccurate estimation

In a hierarchical model it is assumed that although in-dividuals vary in ability the distribution governing indi-vidualsrsquo true abilities is orderly We term this distributionthe parent distribution If we knew the parent distributionthis prior information would be useful in estimationConsider the example in Figure 15 Suppose 10 individ-uals each performed 10 trials Hypothetical performanceis indicated with Xs Letrsquos suppose the task is relativelyeasy so that individualsrsquo true probabilities of successrange from 7 to 10 as indicated by the curve in the fig-

ure One participant does poorly however succeeding ononly 3 trials The conventional estimate for this poor-performing participant is 3 which is a poor estimate ofthe true probability (between 7 and 10) If the parentdistribution is known it would be evident that this poorscore is influenced by a large degree of sample noise andshould be corrected upward as indicated by the arrowThe new estimate which is more accurate is denoted byan H (for hierarchical)

In the example above we assumed a parent distribu-tion In actuality such assumptions are unwarranted Inhierarchical models parent distributions are estimatedfrom the data these parents in turn affect the estimatesfrom outlying data The estimated parent for the data inFigure 15 would have much of its mass in the upperranges When serving as a prior the estimated parentwould exert an upward tendency in the estimation of theoutlying participantrsquos probability leading to better esti-mation In Bayesian analysis the method of implement-ing a hierarchical model is to use a hierarchical priorThe parent distribution serves as a prior for each indi-vidual and this parent is termed the first stage of theprior The parent distribution is not fully specified butinstead free parameters of the parent are estimated fromthe data These parameters also need a prior on themwhich is termed the second stage of the prior

The probit transform model is ideally suited for a hi-erarchical prior A normally distributed parent may beplaced on transformed probabilities of success The ith

Prior 1 Prior 2 Prior 3

Priorson z

Corresponding Priors on p

N(01) N(11) N(0 2)4

3

2

1

0

4

3

2

1

0

4

3

2

1

0

ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6

5

4

3

2

1

0

5

4

3

2

1

0

5

4

3

2

1

0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Figure 13 Three normally distributed priors on z and the corresponding priors on p Priors on z were obtainedby drawing 100000 samples from the normal Priors on p were obtained by transforming each sample of z

HIERARCHICAL BAYESIAN MODELS 589

participantrsquos true ability zi is simply a sample from thisnormal The number of successes is modeled as

(32)

with a parent distribution (first-stage prior) given by

(33)

In this case parameters μ and σ 2 describe the locationand variance of the parent distribution and reflect group-level properties Priors are needed on μ and σ 2 (second-stage priors) Suitable choices are

(34)

andσ 2 ~ Inverse Gamma (a b) (35)

Values of μ0 σ 02 a and b must be specif ied before

analysisWe performed an analysis on hypothetical data in order

to demonstrate the hierarchical model Table 2 shows hy-pothetical data for a set of 20 participants performing 50trials These success frequencies were generated from abinomial but each individual had a unique true proba-bility of success These true probabilities2 which rangedbetween 59 and 80 are shown in the second column ofTable 2 Parameter estimators p0 (Equation 12) and p2(Equation 14) are shown

All of the theoretical results needed to derive the con-ditional posterior distributions have already been pre-

sented The Albert and Chib (1995) analysis for this ap-plication is provided in Appendix B A coded version ofMCMC for this model is presented in Appendix C

The MCMC chain was run for 10000 iterations Pri-ors were set to be fairly diffuse (μ0 0 σ 0

2 1000 a b 1) All parameters converged to steady-state behav-ior quickly (burn-in was conservative at 100 iterations)Autocorrelation of samples of pi for 2 select participantsis shown in the bottom panels of Figure 16 The auto-correlation in these plots was typical of other param-eters Autocorrelation for this model was no worse thanthat for the nonhierarchical Bayesian model of the pre-ceding section The top left panel of Figure 16 showsposterior means of the probability of success as a func-tion of the true probability of success for the hierarchi-cal model (filled circles) Also included are classical estimates from p0 (filled circles) Points from the hier-archical model are on average closer to the diagonalAs is indicated in Table 2 the hierarchical model haslower root mean squared error than does its frequentistcounterpart p0 The top right panel shows the hierarchi-cal estimates as a function of p0 The effect of the hier-archical structure is clear The extreme estimates in thefrequentist approach are moderated in the Bayesiananalysis

Estimator p2 also moderates extreme values The ten-dency is to pull them toward the middle specifically es-timates are closer to 5 than they are with p0 For the dataof Table 2 the difference in efficiency between p2 andp0 is exceedingly small (about 0001) The reason thatthere was little gain for p2 is that it pulled estimates to-ward a value of 5 which is not the mean of the trueprobabilities The mean was p 69 and for this valueand these sample sizes estimators p2 and p0 have aboutthe same efficiency The hierarchical estimator is supe-

μ μ σ~ Normal 02( )

zi | μ σ μ σ2 ind~ Normal 2( )

y z N zi i i i| ind~ Binomial Φ( )⎡⎣ ⎤⎦

Figure 14 (Top) Histograms of the post-burn-in samples of zand p respectively for 7 successes out of 10 trials The line fit forp is the true posterior a beta distribution with a primeprime 8 and bprimeprime 4(Bottom) Autocorrelation function for z There is a small degreeof autocorrelation

Figure 15 Example of how a parent distribution affects esti-mation The outlying observation is incompatible with the parentdistribution Bayesian estimation allows for the influence of theparent as a prior The resulting estimate is shifted upward re-ducing error

590 ROUDER AND LU

rior to these nonhierarchical estimators because it pullsoutlying estimates toward the mean value p Of coursethe advantage of the hierarchical model is a function ofsample size With increasing sample sizes the amountof pulling decreases In the limit all three estimatorsconverge to the true individualrsquos probability

A HIERARCHICAL SIGNAL DETECTION MODEL

Our ultimate goal is a signal detection model that ac-counts for both participant and item variability As a pre-cursor we present a hierarchical signal detection modelwithout item variability the model will be expanded inthe next section to include item variability The modelhere is applicable to psychophysical experiments in whichsignal and noise trials are identical replicates respectively

Signal detection parameters are simple subtractions ofprobit-transformed probabilities Let d primei and ci denotesignal detection parameters for the ith participant

and

where pi(h) and p

i(f ) are individualsrsquo true hit and false

alarm probabilities respectively Let hi and fi be the pro-bit transforms hi1( p i

(h)) and fi1( pi(f )) Then

sensitivity and criteria are given by

d primei hi fiand

ci fi

The tactic we follow is to place hierarchical models on hand f rather than on signal detection parameters directlyBecause the relationship between signal detection pa-rameters and probit-transformed probabilities is linearaccurate estimation of hi and fi results in accurate esti-mation of d primei and ci

The model is developed as follows Let Ni and Si de-note the number of noise and signal trials given to the ithparticipant Let y

i(h) and y

i(f ) denote the the number of

hits and false alarms respectively The model for yi(h) and

yi(f ) is given by

and

It is assumed that each participantrsquos parameters aredrawn from parent distributions These parent distribu-tions serve as the first stage of the hierarchical prior andare given by

(36)

and

(37)

Second-stage priors are placed on parent distribution pa-rameters (μk σ k

2) as

(38)

and

(39)σ k k ka b k2 ~ Inverse Gamma h f( ) =

μk k ku v k~ Normal h f( ) =

f iind~ Normal f fμ σ 2( )

hiind~ Normal h hμ σ 2( )

y f N fi i i i( ) | f ind

~ Binomial Φ( )⎡⎣ ⎤⎦

y h S hi i i i( ) | h ind

~ Binomial Φ( )⎡⎣ ⎤⎦

c pi i= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pi i iΦ Φ1 1( ) ( )h f

Table 2Data and Estimates of a Probability of Success

Participant True Prob Data p0 p2 Hierarchical

1 587 33 66 654 6722 621 28 56 558 6153 633 34 68 673 6834 650 29 58 577 6275 659 32 64 635 6606 665 37 74 731 7167 667 30 60 596 6388 667 29 58 577 6279 687 34 68 673 684

10 693 34 68 673 68211 696 34 68 673 68312 701 37 74 731 71513 713 38 76 750 72814 734 38 76 750 72715 736 37 74 731 71716 738 33 66 654 67217 743 37 74 731 71618 759 33 66 654 67219 786 38 76 750 72820 803 42 84 827 771

RMSE 053 053 041Data are numbers of successes in 50 trials

HIERARCHICAL BAYESIAN MODELS 591

Analysis takes place in two steps The first is to obtainposterior samples from hi and fi for each participantThese are probit-transformed parameters that may be es-timated following the exact procedure in the previoussection Separate runs are made for signal trials (produc-ing estimates of hi) and for noise trials (producing esti-mates of fi) These posterior samples are denoted [hi] and[ fi] respectively The second step is to use these posteriorsamples to estimate posterior distributions of individuald primei and ci This is easily accomplished For all samples

(40)

and(41)

Hypothetical data and parameter estimates are shownin Table 3 Individual variability in the data was imple-mented by sampling each participantrsquos d prime and c from alog-normal and from a normal distribution respectivelyTrue d prime values are shown and they range from 123 to234 From these true values of d prime and c true individualhit and false alarm probabilities were computed and in-dividual hit and false alarm frequencies were sampledfrom a binomial Hit and false alarm frequencies wereproduced in this manner for 20 hypothetical individualseach observing 50 signal and 50 noise trials

A conventional sensitivity estimate for each individ-ual was obtained using the formula

Φ1(Hit rate) Φ1(False alarm rate)

The resulting estimates are shown in the last column ofTable 3 labeled Conventional The hierarchical signaldetection model was analyzed with diffuse prior param-eters ah af bh bf 01 uh uf 0 and vh vf 1000 The chain was run for 10000 iterations with thefirst 500 serving as burn-in This choice of burn-in pe-riod is very conservative Autocorrelation for sensitivityfor 2 select participants is displayed in the bottom pan-els of Figure 17 this degree of autocorrelation was typ-ical for all parameters Autocorrelation was about thesame as in the previous application and the resulting es-timates are shown in the next-to-last column of Table 3

The top left panel of Figure 17 shows a plot of esti-mated dprime as a function of true d prime Estimates from the hi-erarchical model are closer to the diagonal than are theconventional estimates indicating more accurate esti-mation Table 3 also provides efficiency information thehierarchical model is 33 more efficient than the con-ventional estimates (this amount is typical of repeatedruns of the same simulation) The reason for this gain isevident in the top right panel of Figure 17 The extremeestimates in the conventional method which most likely

c fi m i m⎡⎣ ⎤⎦ = minus ⎡⎣ ⎤⎦

prime⎡⎣ ⎤⎦ = ⎡⎣ ⎤⎦ minus ⎡⎣ ⎤⎦d h fi m i m m

ConventionalHierarchical

Participant 5 Participant 15

True Probability Conventional Estimates

Lag Lag

5 6 7 8 9 5 6 7 8 9

0 10 20 30 40 0 10 20 30 40

Aut

ocor

rela

tion

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

9

8

7

6

5

8

4

0

9

8

7

6

5

8

4

0

Figure 16 Modeling the probability of success across several partici-pants (Top left) Probability estimates as functions of true probabilitythe triangles represent estimator p0 and the circles represent posteriormeans of pi from the hierarchical model (Top right) Hierarchical esti-mates as a function of p0 (Bottom) Autocorrelation functions of the pos-terior probability estimates for 2 participants from the hierarchicalmodel The degree of autocorrelation is typical of the other participants

592 ROUDER AND LU

reflect sample noise are shrunk to more moderate esti-mates in the hierarchical model

SIGNAL DETECTION WITH ITEMVARIABILITY

In the introduction we stressed that Bayesian hierar-chical modeling may provide a solution for the statisti-cal problems associated with unmodeled variability innonlinear contexts In all of the previous examples thegains from hierarchical modeling were in better effi-ciency Conventional analysis although not as accurateas Bayesian analysis certainly yielded consistent if notunbiased estimates In this section item variability isadded to signal detection Conventional analyses inwhich scores are aggregated over items result in incon-sistent estimates characterized by asymptotic bias Inthis section we show that in contrast to conventionalanalysis Bayesian hierarchical models provide consis-tent and accurate estimation

Our route to the model is a bit circuitous Our first at-tempt based on a straightforward extension of the pre-vious techniques fails because of an excessive amountof autocorrelation The solution to this failure is to use apowerful and recently developed sampling schemeblocked sampling (Roberts amp Sahu 1997) First we willproceed with the straightforward but problematic exten-sion Then we will introduce blocked sampling and showhow it leads to tractable analysis

Straightforward ExtensionIt is straightforward to specify models that include

item effects The previous notation is here expanded byindexing hits and false alarms by both participants and

items Let yij(h) and yij

(f ) denote the response of the ith per-son to the jth item Values of yij

(h) and yij(f ) are dichoto-

mous indicating whether the response was old or newLet pij

(h) and pij(f ) be true probabilities that yij

(h) 1 andyij

(f ) 1 respectively that is these probabilities are truehit and false alarm rates for a given participant-by-itemcombination respectively Let hij and fij be the probittransforms [hij Φ1( pij

(h)) fij Φ1( pij(f ))] The model

is given by

(42)

and

(43)

Linear models are placed on hij and fij

hij μh αi γj (44)

and

fij μf βi δj (45)

Parameters μh and μf correspond to the grand means ofthe hit and false alarm rates respectively Parameters ααand ββ denote participant effects on hits and false alarms respectively parameters γγ and δδ denote item effects onhits and false alarms respectively These participant anditem effects are assumed to be zero-centered random ef-fects Grand mean signal detection parameters denotedd prime and c are given by

d prime μh μf (46)

and

c μf (47)

y fij ij( ) f ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

y hij ij( )h ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

Table 3Data and Estimates of Sensitivity

FalseParticipant True d prime Hits Alarms Hierarchical Conventional

1 1238 35 7 1608 16052 1239 38 17 1307 11193 1325 38 11 1554 14784 1330 33 16 1146 08805 1342 40 19 1324 11476 1405 39 8 1730 17677 1424 37 12 1466 13508 1460 27 5 1417 13829 1559 35 9 1507 1440

10 1584 37 9 1599 155911 1591 33 8 1486 140712 1624 40 7 1817 192213 1634 35 11 1427 129714 1782 43 13 1687 172415 1894 33 3 1749 196716 1962 45 12 1839 198817 1967 45 4 2235 268718 2124 39 6 1826 194719 2270 45 5 2172 256320 2337 46 6 2186 2580

RMSE 183 275Data are numbers of signal responses in 50 trials

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 12: Rouder & Lu (2005)

584 ROUDER AND LU

ward method of obtaining marginals is to integrate conditionals

In this case the integral is tractable but in many cases itis not When an integral is symbolically intractable how-ever it can often still be evaluated numerically One ofthe recent developments in numeric integration is Markovchain Monte Carlo (MCMC) sampling MCMC sam-pling has become quite popular in the last decade and isdescribed in detail in Gelfand and Smith (1990) We showhere how MCMC can be used to estimate the marginalposterior distributions of μ and σ 2 without assumingknown values We use a particular type of MCMC sam-pling known as Gibbs sampling (Geman amp Geman 1984)Gibbs sampling is a restricted form of MCMC the moregeneral form is known as MetropolisndashHastings MCMC

The goal is to find the marginal posterior distribu-tions MCMC provides a set of random samples from themarginal posterior distributions (rather than a closed-form derivation of the posterior density or cumulativedistribution function) Obtaining a set of random sam-ples from the marginal posteriors is sufficient for analy-sis With a sufficiently large sample from a random vari-able one can compute its density mean quantiles andany other statistic to arbitrary precision

To explain Gibbs sampling we will digress momen-tarily using a different example Consider the case ofgenerating samples from an ex-Gaussian random vari-able The ex-Gaussian is a well-known distribution incognitive psychology (Hohle 1965) It is the sum of nor-mal and exponential random variables The parametersare μ σ and τ the location and scale of the normal andthe scale of the exponential respectively An ex-Gaussianrandom variable x can be expressed in conditional form

andη ~ Exponential(τ )

We can take advantage of this conditional form to gen-erate ex-Gaussian samples First we generate a set ofsamples from η Samples will be denoted with squarebrackets [η]1 denotes the first sample [η]2 denotes thesecond and so on These samples can then be used in theconditional form to generate samples from the marginalrandom variable x For example the first two samples aregiven by

and

In this manner a whole set of samples from the mar-ginal x can be generated from the conditional randomvariable x |η Figure 9 shows that the method works Ahistogram of [x] generated in this manner converges tothe true ex-Gaussian density This is an example of MonteCarlo integration of a conditional density

We now return to the problem of deriving marginalposterior distributions of μ | w from the conditionalμ |σ 2 w If there was a set of samples from σ 2 | w MonteCarlo integration could be used to sample μ | w Like-wise if there was a set of samples of μ | w Monte Carlointegration could be used to sample σ 2 | w In Gibbs sam-pling these relations are used iteratively as followsFirst a value of [σ 2]1 is picked arbitrarily This value isthen used to sample the posterior of μ from the condi-tional in Equation 25

where μprime and σ 2 prime are the explicit functions of σ 2 givenin Equations 23 and 24 Next the value of [μ]1 is used tosample σ 2 from the conditional posterior in Equation 28In general for iteration m

(29)

and

(30)

In this manner sequences of random samples [μμ] and[σσ 2] are obtained

The initial samples of [μμ] and [σσ 2] certainly reflect thechoice of starting values [σ 2]1 and are not samples fromthe desired marginal posteriors It can be shown how-ever that under mild technical conditions the later sam-ples of [μμ] and [σσ 2] are samples from the desired mar-ginal posteriors (Tierney 1994) The initial region in

σ μ21

⎡⎣ ⎤⎦ prime prime ⎡⎣ ⎤⎦( )( )minusm ma b~ Inverse Gamma

[ ] μ μ σ σ σm m m~ Normal prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime2 2 2

μ μ σ σ σ⎡⎣ ⎤⎦ prime ⎡⎣ ⎤⎦( ) ⎡⎣ ⎤⎦( )( )prime1

2 2~ Normal1

2

1

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )2 22~ Normal +μ η σ

x⎡⎣ ⎤⎦ ⎡⎣ ⎤⎦( )1 12~ Normal +μ η σ

x | ~ η μ η σNormal +( )2

f f f d( | ) | | μ μ σ σ σw w w= ( ) ( )int 2 2 2

Figure 9 An example of Monte Carlo integration of a normalconditional density to yield ex-Gaussian distributed samples Thehistogram presents the samples from a normal whose mean pa-rameter is distributed as an exponential The line is the densityof the appropriate ex-Gaussian distribution

HIERARCHICAL BAYESIAN MODELS 585

which the samples reflect the starting value is termed theburn-in period The later samples are the steady-state pe-riod Formally speaking the samples form irreducibleMarkov chains with stationary distributions that equalthe marginal posterior distribution (see Gilks Richard-son amp Spiegelhalter 1996) On a more informal levelthe first set of samples have random and arbitrary valuesAt some point by chance the random sample happens tobe from a high-density region of the true marginal pos-terior From this point forward the samples are from thetrue posteriors and may be used for estimation

Let m0 denote the point at which the samples are ap-proximately steady state Samples after m0 can be used toprovide both posterior means and credible regions of theposterior (as well as other statistics such as posteriorvariance) Posterior means are estimated by the arith-metic means of the samples

and

where M is the number of iterations Credible regions areconstructed as the centermost interval containing 95of the samples from m0 to M

To use MCMC to estimate posteriors of μ and σ 2 it isnecessary to sample from a normal and an inverse-gamma

distribution Sampling from a normal is provided in manysoftware packages and routines for low-level languagessuch as C or Fortran are readily available Samples froman inverse-gamma distribution may be obtained by tak-ing the reciprocal of samples from a gamma distributionAhrens and Dieter (1974 1982) provide algorithms forsampling the gamma distribution

To illustrate Gibbs sampling consider the case of es-timating IQ The hypothetical data for this example arefour replicates from a single participant w (112 106104 111) The classical estimation of μ is the samplemean (10825) and confidence intervals are constructedby multiplying the standard error (sw193) by the ap-propriate critical t value with three degrees of freedomFrom this multiplication the 95 confidence intervalfor μ is [1022 1144] Bayesian estimation was donewith Gibbs sampling Diffuse priors were placed on μ(μ0 0 σ 0

2 106) and σ 2 (a 106 b 106) Twochains of [μμ] are shown in the left panel of Figure 10each corresponding to a different initial value for [σ 2]1As can be seen both chains converge to their steady staterather quickly For this application there is very littleburn-in The histogram of the samples of [μμ] after burn-in is also shown (right panel) The histogram conforms toa t distribution with three degrees of freedom The poste-rior mean is 10826 and the 95 credible interval is[1022 1144] For normally distributed data the diffusepriors used in Bayesian analysis yield results that are nu-merically equivalent to their frequentist counterparts

The diffuse priors used in the previous case representthe situation in which researchers have no previous knowl-

σσ

2

2

0

0=

⎡⎣ ⎤⎦

minusgtsum

mm m

M m

μμ

=minus

gtsum [ ]m

m m

M m0

0

15

10

05

0

120

115

110

105

100

0 10 20 30 40 50 95 100 105 110 115 120IQ PointsIteration

Den

sity

IQ points

micro

Burn-in

Figure 10 Samples of μμ from Markov chain Monte Carlo integration for normally dis-tributed data The left panel shows two chains each from a poor choice of the initial value ofσσ 2 (the solid and dotted lines are for [σσ 2]1 10000 and [σσ 2]1 0001 respectively) Con-vergence is rapid The right panel shows the histogram of samples of μμ after burn-in Thedensity is the appropriate frequentist t distribution with three degrees of freedom

586 ROUDER AND LU

edge In the IQ example however much is known aboutthe distribution of IQ scores Suppose our test was normedfor a mean of 100 and a standard deviation of 15 Thisknowledge could be profitably used by setting μ0 100and σ 0

2 152 225 We may not know the particulartestndashretest correlation of our scales but we can surmisethat the standard deviations of replicates should be lessthan the population standard deviation After experiment-ing with the values of a and b we chose a 3 and b 100 The prior on σ 2 corresponding to these values isshown in Figure 11 It has much of its density below 100reflecting our belief that the testndashretest variability issmaller than the variability in the population Figure 11also shows a histogram of the posterior of μ (the posteriorwas obtained with Gibbs sampling as outlined above[σ 2]1 1 burn-in of 100 iterations) The posterior meanis 10795 which is somewhat below the sample mean of10825 This difference is from the prior people on aver-age score lower than the obtained scores Although it is abit hard to see in the figure the posterior distribution ismore narrow in the extreme tails than is the correspond-ing t distribution The 95 credible interval is [10211136] which is modestly different from the correspond-ing frequentist confidence interval This analysis shows

that the use of reasonable prior information can lead to aresult different from that of the frequentist analysis TheBayesian estimate is not only different but often more ef-ficient than the frequentist counterpart

The theoretical underpinnings of MCMC guaranteethat infinitely long chains will converge to the true pos-terior Researchers must decide how long to burn in chainsand then how long to continue in order to approximatethe posterior well The most important consideration inthis decision is the degree of correlation from one sam-ple to another In MCMC sampling consecutive samplesare not necessarily independent of one another The valueof a sample on cycle m 1 is in general dependent onthat of m If this dependence is small then convergencehappens quickly and relatively short chains are neededIf this dependence is large then convergence is muchslower One informal graphical method of assessing thedegree of dependence is to plot the autocorrelation func-tions of the chains The bottom right panel of Figure 11shows that for the normal application there appears to beno autocorrelation that is there is no dependence fromone sample to the next This fact implies that conver-gence is relatively rapid Good approximations may beobtained with short burn-ins and relatively short chains

Figure 11 Analysis with informative priors The two top panels showthe prior distributions on μμ and σσ 2 The bottom left panel shows the his-togram of samples of μμ after burn-in It differs from a t distribution fromfrequentist analysis in that it is shifted toward 100 and has a smallerupper tail The bottom right panel shows the autocorrelation functionfor μμ There is no discernible autocorrelation

HIERARCHICAL BAYESIAN MODELS 587

There are a number of formal procedures for assessingconvergence in the statistical literature (see eg Gel-man amp Rubin 1992 Geweke 1992 Raftery amp Lewis1992 see also Gelman et al 2004)

A PROBIT MODEL FOR ESTIMATING A PROBABILITY OF SUCCESS

The next step toward a hierarchical signal detectionmodel is to revisit estimation of a probability from indi-vidual trials Once again y successes are observed in Ntrials and y ~ Binomial(N p) where p is the parameterof interest In the previous section a beta prior was placedon parameter p and the conjugacy of the beta with the bi-nomial directly led to the posterior Unfortunately a betadistribution prior is not convenient as a prior for the sig-nal detection application

The signal detection model relies on the probit trans-form Figure 12 shows the probit transform which mapsprobabilities into z values using the normal inverse cu-mulative distribution function Whereas p ranges from 0to 1 z ranges across the real number line Let Φ denotethe standard normal cumulative distribution function Withthis notation p Φ(z) and conversely z Φ1( p)

Signal detection parameters are defined in terms ofprobit transforms

and

where p(h) and p(f ) are hit and false alarm probabilities re-spectively As a precursor to models of signal detectionwe develop estimation of the probit transform of a proba-bility of success z Φ1( p) The methods for doing sowill then be used in the signal detection application

When using a probit transform we model the y suc-cesses in N trials as

y ~ Binomial[N Φ(z)]

where z is the parameter of interestWhen using a probit it is convenient to place a normal

prior on z

(31)

Figure 13 shows various normal priors on z and by trans-form the corresponding prior on p Consider the specialcase when a standard normal is placed on z As is shownin the figure the corresponding prior on p is flat A flatprior also corresponds to a beta with a b 1 (see Fig-ure 6) From the previous development if a beta prior isplaced on p the posterior is also a beta with parametersaprime a y and bprime b (N y) Noting for a flat priorthat a b 1 it is expected that the posterior of p is dis-tributed as a beta with parameters aprime 1 y and bprime 1 (N y)

The next step is to derive the posterior f (z |Y ) for anormal prior on z The straightforward approach is tonote that the posterior is proportional to the likelihoodmultiplied by the prior

This equation is intractable Albert and Chib (1995) pro-vide an alternative for analysis and the details of this al-ternative are provided in Appendix A The basic idea isthat a new set of latent variables is defined upon which zis conditioned Conditional posteriors for both z andthese new latent variables are easily derived and sam-pled These samples can then be used in Gibbs samplingto find a marginal posterior for z

In this application it may be desirable to estimate theposterior of p as well as that of z This estimation is easyin MCMC Because p is the inverse probit transform ofz samples can also be inversely transformed Let [z] bethe chain of samples from the posterior of z A chain ofposterior samples of p is given by [ p]m Φ( [z]m)

The top right panel of Figure 14 shows the histogramof the posterior of p for y 7 successes on N 10 tri-als The prior parameters were μ0 0 and σ 0

2 1 so theprior on p was flat (see Figure 13) Because the prior isflat it corresponds to a beta with parameters (a 1 b 1) Consequently the expected posterior is a beta withdensity aprime y 1 8 and bprime (N y) 1 4 Thechains in the Gibbs sampler were started from a numberof values of [z]1 and convergence was always rapid Theresulting histogram for 10000 samples is plotted it ap-proximates the expected distribution The bottom panelshows a small degree of autocorrelation This autocorre-lation can be overcome by running longer chains andusing a more conservative burn-in period Chains of

f z Y z z

z

Y N Y( | ) ( ) [ ( )]

exp

Φ Φ1

20

2

02 1

minus

times minus minus( ) ( )minus

minusμ σ⎡⎡

⎣⎢⎤⎦⎥

z ~ Normal 0μ σ02( )

c p= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pΦ Φ1 1( ) ( )h f

Figure 12 Probit transform of probability ( p) into z scores (z)The transform is the inverse cumulative distribution function ofthe normal distribution Lines show the transform of p 69 intoz 05

588 ROUDER AND LU

1000 samples with 100 samples of burn-in are morethan sufficient for this application

ESTIMATING PROBABILITY FORSEVERAL PARTICIPANTS

In this section we propose a model for estimating sev-eral participantsrsquo probability of success The main goalis to introduce hierarchical models within the context ofa relatively simple example The methods and insightsdiscussed here are directly applicable to the hierarchicalmodel of signal detection Consider data from a numberof participants each with his or her own unique proba-bility of success pi In conventional estimation each ofthese probabilities is estimated separately usually by theestimator piyi Ni where yi and Ni are the number ofsuccesses and trials respectively for the ith participantWe show here that hierarchical modeling leads to moreaccurate estimation

In a hierarchical model it is assumed that although in-dividuals vary in ability the distribution governing indi-vidualsrsquo true abilities is orderly We term this distributionthe parent distribution If we knew the parent distributionthis prior information would be useful in estimationConsider the example in Figure 15 Suppose 10 individ-uals each performed 10 trials Hypothetical performanceis indicated with Xs Letrsquos suppose the task is relativelyeasy so that individualsrsquo true probabilities of successrange from 7 to 10 as indicated by the curve in the fig-

ure One participant does poorly however succeeding ononly 3 trials The conventional estimate for this poor-performing participant is 3 which is a poor estimate ofthe true probability (between 7 and 10) If the parentdistribution is known it would be evident that this poorscore is influenced by a large degree of sample noise andshould be corrected upward as indicated by the arrowThe new estimate which is more accurate is denoted byan H (for hierarchical)

In the example above we assumed a parent distribu-tion In actuality such assumptions are unwarranted Inhierarchical models parent distributions are estimatedfrom the data these parents in turn affect the estimatesfrom outlying data The estimated parent for the data inFigure 15 would have much of its mass in the upperranges When serving as a prior the estimated parentwould exert an upward tendency in the estimation of theoutlying participantrsquos probability leading to better esti-mation In Bayesian analysis the method of implement-ing a hierarchical model is to use a hierarchical priorThe parent distribution serves as a prior for each indi-vidual and this parent is termed the first stage of theprior The parent distribution is not fully specified butinstead free parameters of the parent are estimated fromthe data These parameters also need a prior on themwhich is termed the second stage of the prior

The probit transform model is ideally suited for a hi-erarchical prior A normally distributed parent may beplaced on transformed probabilities of success The ith

Prior 1 Prior 2 Prior 3

Priorson z

Corresponding Priors on p

N(01) N(11) N(0 2)4

3

2

1

0

4

3

2

1

0

4

3

2

1

0

ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6

5

4

3

2

1

0

5

4

3

2

1

0

5

4

3

2

1

0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Figure 13 Three normally distributed priors on z and the corresponding priors on p Priors on z were obtainedby drawing 100000 samples from the normal Priors on p were obtained by transforming each sample of z

HIERARCHICAL BAYESIAN MODELS 589

participantrsquos true ability zi is simply a sample from thisnormal The number of successes is modeled as

(32)

with a parent distribution (first-stage prior) given by

(33)

In this case parameters μ and σ 2 describe the locationand variance of the parent distribution and reflect group-level properties Priors are needed on μ and σ 2 (second-stage priors) Suitable choices are

(34)

andσ 2 ~ Inverse Gamma (a b) (35)

Values of μ0 σ 02 a and b must be specif ied before

analysisWe performed an analysis on hypothetical data in order

to demonstrate the hierarchical model Table 2 shows hy-pothetical data for a set of 20 participants performing 50trials These success frequencies were generated from abinomial but each individual had a unique true proba-bility of success These true probabilities2 which rangedbetween 59 and 80 are shown in the second column ofTable 2 Parameter estimators p0 (Equation 12) and p2(Equation 14) are shown

All of the theoretical results needed to derive the con-ditional posterior distributions have already been pre-

sented The Albert and Chib (1995) analysis for this ap-plication is provided in Appendix B A coded version ofMCMC for this model is presented in Appendix C

The MCMC chain was run for 10000 iterations Pri-ors were set to be fairly diffuse (μ0 0 σ 0

2 1000 a b 1) All parameters converged to steady-state behav-ior quickly (burn-in was conservative at 100 iterations)Autocorrelation of samples of pi for 2 select participantsis shown in the bottom panels of Figure 16 The auto-correlation in these plots was typical of other param-eters Autocorrelation for this model was no worse thanthat for the nonhierarchical Bayesian model of the pre-ceding section The top left panel of Figure 16 showsposterior means of the probability of success as a func-tion of the true probability of success for the hierarchi-cal model (filled circles) Also included are classical estimates from p0 (filled circles) Points from the hier-archical model are on average closer to the diagonalAs is indicated in Table 2 the hierarchical model haslower root mean squared error than does its frequentistcounterpart p0 The top right panel shows the hierarchi-cal estimates as a function of p0 The effect of the hier-archical structure is clear The extreme estimates in thefrequentist approach are moderated in the Bayesiananalysis

Estimator p2 also moderates extreme values The ten-dency is to pull them toward the middle specifically es-timates are closer to 5 than they are with p0 For the dataof Table 2 the difference in efficiency between p2 andp0 is exceedingly small (about 0001) The reason thatthere was little gain for p2 is that it pulled estimates to-ward a value of 5 which is not the mean of the trueprobabilities The mean was p 69 and for this valueand these sample sizes estimators p2 and p0 have aboutthe same efficiency The hierarchical estimator is supe-

μ μ σ~ Normal 02( )

zi | μ σ μ σ2 ind~ Normal 2( )

y z N zi i i i| ind~ Binomial Φ( )⎡⎣ ⎤⎦

Figure 14 (Top) Histograms of the post-burn-in samples of zand p respectively for 7 successes out of 10 trials The line fit forp is the true posterior a beta distribution with a primeprime 8 and bprimeprime 4(Bottom) Autocorrelation function for z There is a small degreeof autocorrelation

Figure 15 Example of how a parent distribution affects esti-mation The outlying observation is incompatible with the parentdistribution Bayesian estimation allows for the influence of theparent as a prior The resulting estimate is shifted upward re-ducing error

590 ROUDER AND LU

rior to these nonhierarchical estimators because it pullsoutlying estimates toward the mean value p Of coursethe advantage of the hierarchical model is a function ofsample size With increasing sample sizes the amountof pulling decreases In the limit all three estimatorsconverge to the true individualrsquos probability

A HIERARCHICAL SIGNAL DETECTION MODEL

Our ultimate goal is a signal detection model that ac-counts for both participant and item variability As a pre-cursor we present a hierarchical signal detection modelwithout item variability the model will be expanded inthe next section to include item variability The modelhere is applicable to psychophysical experiments in whichsignal and noise trials are identical replicates respectively

Signal detection parameters are simple subtractions ofprobit-transformed probabilities Let d primei and ci denotesignal detection parameters for the ith participant

and

where pi(h) and p

i(f ) are individualsrsquo true hit and false

alarm probabilities respectively Let hi and fi be the pro-bit transforms hi1( p i

(h)) and fi1( pi(f )) Then

sensitivity and criteria are given by

d primei hi fiand

ci fi

The tactic we follow is to place hierarchical models on hand f rather than on signal detection parameters directlyBecause the relationship between signal detection pa-rameters and probit-transformed probabilities is linearaccurate estimation of hi and fi results in accurate esti-mation of d primei and ci

The model is developed as follows Let Ni and Si de-note the number of noise and signal trials given to the ithparticipant Let y

i(h) and y

i(f ) denote the the number of

hits and false alarms respectively The model for yi(h) and

yi(f ) is given by

and

It is assumed that each participantrsquos parameters aredrawn from parent distributions These parent distribu-tions serve as the first stage of the hierarchical prior andare given by

(36)

and

(37)

Second-stage priors are placed on parent distribution pa-rameters (μk σ k

2) as

(38)

and

(39)σ k k ka b k2 ~ Inverse Gamma h f( ) =

μk k ku v k~ Normal h f( ) =

f iind~ Normal f fμ σ 2( )

hiind~ Normal h hμ σ 2( )

y f N fi i i i( ) | f ind

~ Binomial Φ( )⎡⎣ ⎤⎦

y h S hi i i i( ) | h ind

~ Binomial Φ( )⎡⎣ ⎤⎦

c pi i= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pi i iΦ Φ1 1( ) ( )h f

Table 2Data and Estimates of a Probability of Success

Participant True Prob Data p0 p2 Hierarchical

1 587 33 66 654 6722 621 28 56 558 6153 633 34 68 673 6834 650 29 58 577 6275 659 32 64 635 6606 665 37 74 731 7167 667 30 60 596 6388 667 29 58 577 6279 687 34 68 673 684

10 693 34 68 673 68211 696 34 68 673 68312 701 37 74 731 71513 713 38 76 750 72814 734 38 76 750 72715 736 37 74 731 71716 738 33 66 654 67217 743 37 74 731 71618 759 33 66 654 67219 786 38 76 750 72820 803 42 84 827 771

RMSE 053 053 041Data are numbers of successes in 50 trials

HIERARCHICAL BAYESIAN MODELS 591

Analysis takes place in two steps The first is to obtainposterior samples from hi and fi for each participantThese are probit-transformed parameters that may be es-timated following the exact procedure in the previoussection Separate runs are made for signal trials (produc-ing estimates of hi) and for noise trials (producing esti-mates of fi) These posterior samples are denoted [hi] and[ fi] respectively The second step is to use these posteriorsamples to estimate posterior distributions of individuald primei and ci This is easily accomplished For all samples

(40)

and(41)

Hypothetical data and parameter estimates are shownin Table 3 Individual variability in the data was imple-mented by sampling each participantrsquos d prime and c from alog-normal and from a normal distribution respectivelyTrue d prime values are shown and they range from 123 to234 From these true values of d prime and c true individualhit and false alarm probabilities were computed and in-dividual hit and false alarm frequencies were sampledfrom a binomial Hit and false alarm frequencies wereproduced in this manner for 20 hypothetical individualseach observing 50 signal and 50 noise trials

A conventional sensitivity estimate for each individ-ual was obtained using the formula

Φ1(Hit rate) Φ1(False alarm rate)

The resulting estimates are shown in the last column ofTable 3 labeled Conventional The hierarchical signaldetection model was analyzed with diffuse prior param-eters ah af bh bf 01 uh uf 0 and vh vf 1000 The chain was run for 10000 iterations with thefirst 500 serving as burn-in This choice of burn-in pe-riod is very conservative Autocorrelation for sensitivityfor 2 select participants is displayed in the bottom pan-els of Figure 17 this degree of autocorrelation was typ-ical for all parameters Autocorrelation was about thesame as in the previous application and the resulting es-timates are shown in the next-to-last column of Table 3

The top left panel of Figure 17 shows a plot of esti-mated dprime as a function of true d prime Estimates from the hi-erarchical model are closer to the diagonal than are theconventional estimates indicating more accurate esti-mation Table 3 also provides efficiency information thehierarchical model is 33 more efficient than the con-ventional estimates (this amount is typical of repeatedruns of the same simulation) The reason for this gain isevident in the top right panel of Figure 17 The extremeestimates in the conventional method which most likely

c fi m i m⎡⎣ ⎤⎦ = minus ⎡⎣ ⎤⎦

prime⎡⎣ ⎤⎦ = ⎡⎣ ⎤⎦ minus ⎡⎣ ⎤⎦d h fi m i m m

ConventionalHierarchical

Participant 5 Participant 15

True Probability Conventional Estimates

Lag Lag

5 6 7 8 9 5 6 7 8 9

0 10 20 30 40 0 10 20 30 40

Aut

ocor

rela

tion

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

9

8

7

6

5

8

4

0

9

8

7

6

5

8

4

0

Figure 16 Modeling the probability of success across several partici-pants (Top left) Probability estimates as functions of true probabilitythe triangles represent estimator p0 and the circles represent posteriormeans of pi from the hierarchical model (Top right) Hierarchical esti-mates as a function of p0 (Bottom) Autocorrelation functions of the pos-terior probability estimates for 2 participants from the hierarchicalmodel The degree of autocorrelation is typical of the other participants

592 ROUDER AND LU

reflect sample noise are shrunk to more moderate esti-mates in the hierarchical model

SIGNAL DETECTION WITH ITEMVARIABILITY

In the introduction we stressed that Bayesian hierar-chical modeling may provide a solution for the statisti-cal problems associated with unmodeled variability innonlinear contexts In all of the previous examples thegains from hierarchical modeling were in better effi-ciency Conventional analysis although not as accurateas Bayesian analysis certainly yielded consistent if notunbiased estimates In this section item variability isadded to signal detection Conventional analyses inwhich scores are aggregated over items result in incon-sistent estimates characterized by asymptotic bias Inthis section we show that in contrast to conventionalanalysis Bayesian hierarchical models provide consis-tent and accurate estimation

Our route to the model is a bit circuitous Our first at-tempt based on a straightforward extension of the pre-vious techniques fails because of an excessive amountof autocorrelation The solution to this failure is to use apowerful and recently developed sampling schemeblocked sampling (Roberts amp Sahu 1997) First we willproceed with the straightforward but problematic exten-sion Then we will introduce blocked sampling and showhow it leads to tractable analysis

Straightforward ExtensionIt is straightforward to specify models that include

item effects The previous notation is here expanded byindexing hits and false alarms by both participants and

items Let yij(h) and yij

(f ) denote the response of the ith per-son to the jth item Values of yij

(h) and yij(f ) are dichoto-

mous indicating whether the response was old or newLet pij

(h) and pij(f ) be true probabilities that yij

(h) 1 andyij

(f ) 1 respectively that is these probabilities are truehit and false alarm rates for a given participant-by-itemcombination respectively Let hij and fij be the probittransforms [hij Φ1( pij

(h)) fij Φ1( pij(f ))] The model

is given by

(42)

and

(43)

Linear models are placed on hij and fij

hij μh αi γj (44)

and

fij μf βi δj (45)

Parameters μh and μf correspond to the grand means ofthe hit and false alarm rates respectively Parameters ααand ββ denote participant effects on hits and false alarms respectively parameters γγ and δδ denote item effects onhits and false alarms respectively These participant anditem effects are assumed to be zero-centered random ef-fects Grand mean signal detection parameters denotedd prime and c are given by

d prime μh μf (46)

and

c μf (47)

y fij ij( ) f ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

y hij ij( )h ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

Table 3Data and Estimates of Sensitivity

FalseParticipant True d prime Hits Alarms Hierarchical Conventional

1 1238 35 7 1608 16052 1239 38 17 1307 11193 1325 38 11 1554 14784 1330 33 16 1146 08805 1342 40 19 1324 11476 1405 39 8 1730 17677 1424 37 12 1466 13508 1460 27 5 1417 13829 1559 35 9 1507 1440

10 1584 37 9 1599 155911 1591 33 8 1486 140712 1624 40 7 1817 192213 1634 35 11 1427 129714 1782 43 13 1687 172415 1894 33 3 1749 196716 1962 45 12 1839 198817 1967 45 4 2235 268718 2124 39 6 1826 194719 2270 45 5 2172 256320 2337 46 6 2186 2580

RMSE 183 275Data are numbers of signal responses in 50 trials

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 13: Rouder & Lu (2005)

HIERARCHICAL BAYESIAN MODELS 585

which the samples reflect the starting value is termed theburn-in period The later samples are the steady-state pe-riod Formally speaking the samples form irreducibleMarkov chains with stationary distributions that equalthe marginal posterior distribution (see Gilks Richard-son amp Spiegelhalter 1996) On a more informal levelthe first set of samples have random and arbitrary valuesAt some point by chance the random sample happens tobe from a high-density region of the true marginal pos-terior From this point forward the samples are from thetrue posteriors and may be used for estimation

Let m0 denote the point at which the samples are ap-proximately steady state Samples after m0 can be used toprovide both posterior means and credible regions of theposterior (as well as other statistics such as posteriorvariance) Posterior means are estimated by the arith-metic means of the samples

and

where M is the number of iterations Credible regions areconstructed as the centermost interval containing 95of the samples from m0 to M

To use MCMC to estimate posteriors of μ and σ 2 it isnecessary to sample from a normal and an inverse-gamma

distribution Sampling from a normal is provided in manysoftware packages and routines for low-level languagessuch as C or Fortran are readily available Samples froman inverse-gamma distribution may be obtained by tak-ing the reciprocal of samples from a gamma distributionAhrens and Dieter (1974 1982) provide algorithms forsampling the gamma distribution

To illustrate Gibbs sampling consider the case of es-timating IQ The hypothetical data for this example arefour replicates from a single participant w (112 106104 111) The classical estimation of μ is the samplemean (10825) and confidence intervals are constructedby multiplying the standard error (sw193) by the ap-propriate critical t value with three degrees of freedomFrom this multiplication the 95 confidence intervalfor μ is [1022 1144] Bayesian estimation was donewith Gibbs sampling Diffuse priors were placed on μ(μ0 0 σ 0

2 106) and σ 2 (a 106 b 106) Twochains of [μμ] are shown in the left panel of Figure 10each corresponding to a different initial value for [σ 2]1As can be seen both chains converge to their steady staterather quickly For this application there is very littleburn-in The histogram of the samples of [μμ] after burn-in is also shown (right panel) The histogram conforms toa t distribution with three degrees of freedom The poste-rior mean is 10826 and the 95 credible interval is[1022 1144] For normally distributed data the diffusepriors used in Bayesian analysis yield results that are nu-merically equivalent to their frequentist counterparts

The diffuse priors used in the previous case representthe situation in which researchers have no previous knowl-

σσ

2

2

0

0=

⎡⎣ ⎤⎦

minusgtsum

mm m

M m

μμ

=minus

gtsum [ ]m

m m

M m0

0

15

10

05

0

120

115

110

105

100

0 10 20 30 40 50 95 100 105 110 115 120IQ PointsIteration

Den

sity

IQ points

micro

Burn-in

Figure 10 Samples of μμ from Markov chain Monte Carlo integration for normally dis-tributed data The left panel shows two chains each from a poor choice of the initial value ofσσ 2 (the solid and dotted lines are for [σσ 2]1 10000 and [σσ 2]1 0001 respectively) Con-vergence is rapid The right panel shows the histogram of samples of μμ after burn-in Thedensity is the appropriate frequentist t distribution with three degrees of freedom

586 ROUDER AND LU

edge In the IQ example however much is known aboutthe distribution of IQ scores Suppose our test was normedfor a mean of 100 and a standard deviation of 15 Thisknowledge could be profitably used by setting μ0 100and σ 0

2 152 225 We may not know the particulartestndashretest correlation of our scales but we can surmisethat the standard deviations of replicates should be lessthan the population standard deviation After experiment-ing with the values of a and b we chose a 3 and b 100 The prior on σ 2 corresponding to these values isshown in Figure 11 It has much of its density below 100reflecting our belief that the testndashretest variability issmaller than the variability in the population Figure 11also shows a histogram of the posterior of μ (the posteriorwas obtained with Gibbs sampling as outlined above[σ 2]1 1 burn-in of 100 iterations) The posterior meanis 10795 which is somewhat below the sample mean of10825 This difference is from the prior people on aver-age score lower than the obtained scores Although it is abit hard to see in the figure the posterior distribution ismore narrow in the extreme tails than is the correspond-ing t distribution The 95 credible interval is [10211136] which is modestly different from the correspond-ing frequentist confidence interval This analysis shows

that the use of reasonable prior information can lead to aresult different from that of the frequentist analysis TheBayesian estimate is not only different but often more ef-ficient than the frequentist counterpart

The theoretical underpinnings of MCMC guaranteethat infinitely long chains will converge to the true pos-terior Researchers must decide how long to burn in chainsand then how long to continue in order to approximatethe posterior well The most important consideration inthis decision is the degree of correlation from one sam-ple to another In MCMC sampling consecutive samplesare not necessarily independent of one another The valueof a sample on cycle m 1 is in general dependent onthat of m If this dependence is small then convergencehappens quickly and relatively short chains are neededIf this dependence is large then convergence is muchslower One informal graphical method of assessing thedegree of dependence is to plot the autocorrelation func-tions of the chains The bottom right panel of Figure 11shows that for the normal application there appears to beno autocorrelation that is there is no dependence fromone sample to the next This fact implies that conver-gence is relatively rapid Good approximations may beobtained with short burn-ins and relatively short chains

Figure 11 Analysis with informative priors The two top panels showthe prior distributions on μμ and σσ 2 The bottom left panel shows the his-togram of samples of μμ after burn-in It differs from a t distribution fromfrequentist analysis in that it is shifted toward 100 and has a smallerupper tail The bottom right panel shows the autocorrelation functionfor μμ There is no discernible autocorrelation

HIERARCHICAL BAYESIAN MODELS 587

There are a number of formal procedures for assessingconvergence in the statistical literature (see eg Gel-man amp Rubin 1992 Geweke 1992 Raftery amp Lewis1992 see also Gelman et al 2004)

A PROBIT MODEL FOR ESTIMATING A PROBABILITY OF SUCCESS

The next step toward a hierarchical signal detectionmodel is to revisit estimation of a probability from indi-vidual trials Once again y successes are observed in Ntrials and y ~ Binomial(N p) where p is the parameterof interest In the previous section a beta prior was placedon parameter p and the conjugacy of the beta with the bi-nomial directly led to the posterior Unfortunately a betadistribution prior is not convenient as a prior for the sig-nal detection application

The signal detection model relies on the probit trans-form Figure 12 shows the probit transform which mapsprobabilities into z values using the normal inverse cu-mulative distribution function Whereas p ranges from 0to 1 z ranges across the real number line Let Φ denotethe standard normal cumulative distribution function Withthis notation p Φ(z) and conversely z Φ1( p)

Signal detection parameters are defined in terms ofprobit transforms

and

where p(h) and p(f ) are hit and false alarm probabilities re-spectively As a precursor to models of signal detectionwe develop estimation of the probit transform of a proba-bility of success z Φ1( p) The methods for doing sowill then be used in the signal detection application

When using a probit transform we model the y suc-cesses in N trials as

y ~ Binomial[N Φ(z)]

where z is the parameter of interestWhen using a probit it is convenient to place a normal

prior on z

(31)

Figure 13 shows various normal priors on z and by trans-form the corresponding prior on p Consider the specialcase when a standard normal is placed on z As is shownin the figure the corresponding prior on p is flat A flatprior also corresponds to a beta with a b 1 (see Fig-ure 6) From the previous development if a beta prior isplaced on p the posterior is also a beta with parametersaprime a y and bprime b (N y) Noting for a flat priorthat a b 1 it is expected that the posterior of p is dis-tributed as a beta with parameters aprime 1 y and bprime 1 (N y)

The next step is to derive the posterior f (z |Y ) for anormal prior on z The straightforward approach is tonote that the posterior is proportional to the likelihoodmultiplied by the prior

This equation is intractable Albert and Chib (1995) pro-vide an alternative for analysis and the details of this al-ternative are provided in Appendix A The basic idea isthat a new set of latent variables is defined upon which zis conditioned Conditional posteriors for both z andthese new latent variables are easily derived and sam-pled These samples can then be used in Gibbs samplingto find a marginal posterior for z

In this application it may be desirable to estimate theposterior of p as well as that of z This estimation is easyin MCMC Because p is the inverse probit transform ofz samples can also be inversely transformed Let [z] bethe chain of samples from the posterior of z A chain ofposterior samples of p is given by [ p]m Φ( [z]m)

The top right panel of Figure 14 shows the histogramof the posterior of p for y 7 successes on N 10 tri-als The prior parameters were μ0 0 and σ 0

2 1 so theprior on p was flat (see Figure 13) Because the prior isflat it corresponds to a beta with parameters (a 1 b 1) Consequently the expected posterior is a beta withdensity aprime y 1 8 and bprime (N y) 1 4 Thechains in the Gibbs sampler were started from a numberof values of [z]1 and convergence was always rapid Theresulting histogram for 10000 samples is plotted it ap-proximates the expected distribution The bottom panelshows a small degree of autocorrelation This autocorre-lation can be overcome by running longer chains andusing a more conservative burn-in period Chains of

f z Y z z

z

Y N Y( | ) ( ) [ ( )]

exp

Φ Φ1

20

2

02 1

minus

times minus minus( ) ( )minus

minusμ σ⎡⎡

⎣⎢⎤⎦⎥

z ~ Normal 0μ σ02( )

c p= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pΦ Φ1 1( ) ( )h f

Figure 12 Probit transform of probability ( p) into z scores (z)The transform is the inverse cumulative distribution function ofthe normal distribution Lines show the transform of p 69 intoz 05

588 ROUDER AND LU

1000 samples with 100 samples of burn-in are morethan sufficient for this application

ESTIMATING PROBABILITY FORSEVERAL PARTICIPANTS

In this section we propose a model for estimating sev-eral participantsrsquo probability of success The main goalis to introduce hierarchical models within the context ofa relatively simple example The methods and insightsdiscussed here are directly applicable to the hierarchicalmodel of signal detection Consider data from a numberof participants each with his or her own unique proba-bility of success pi In conventional estimation each ofthese probabilities is estimated separately usually by theestimator piyi Ni where yi and Ni are the number ofsuccesses and trials respectively for the ith participantWe show here that hierarchical modeling leads to moreaccurate estimation

In a hierarchical model it is assumed that although in-dividuals vary in ability the distribution governing indi-vidualsrsquo true abilities is orderly We term this distributionthe parent distribution If we knew the parent distributionthis prior information would be useful in estimationConsider the example in Figure 15 Suppose 10 individ-uals each performed 10 trials Hypothetical performanceis indicated with Xs Letrsquos suppose the task is relativelyeasy so that individualsrsquo true probabilities of successrange from 7 to 10 as indicated by the curve in the fig-

ure One participant does poorly however succeeding ononly 3 trials The conventional estimate for this poor-performing participant is 3 which is a poor estimate ofthe true probability (between 7 and 10) If the parentdistribution is known it would be evident that this poorscore is influenced by a large degree of sample noise andshould be corrected upward as indicated by the arrowThe new estimate which is more accurate is denoted byan H (for hierarchical)

In the example above we assumed a parent distribu-tion In actuality such assumptions are unwarranted Inhierarchical models parent distributions are estimatedfrom the data these parents in turn affect the estimatesfrom outlying data The estimated parent for the data inFigure 15 would have much of its mass in the upperranges When serving as a prior the estimated parentwould exert an upward tendency in the estimation of theoutlying participantrsquos probability leading to better esti-mation In Bayesian analysis the method of implement-ing a hierarchical model is to use a hierarchical priorThe parent distribution serves as a prior for each indi-vidual and this parent is termed the first stage of theprior The parent distribution is not fully specified butinstead free parameters of the parent are estimated fromthe data These parameters also need a prior on themwhich is termed the second stage of the prior

The probit transform model is ideally suited for a hi-erarchical prior A normally distributed parent may beplaced on transformed probabilities of success The ith

Prior 1 Prior 2 Prior 3

Priorson z

Corresponding Priors on p

N(01) N(11) N(0 2)4

3

2

1

0

4

3

2

1

0

4

3

2

1

0

ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6

5

4

3

2

1

0

5

4

3

2

1

0

5

4

3

2

1

0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Figure 13 Three normally distributed priors on z and the corresponding priors on p Priors on z were obtainedby drawing 100000 samples from the normal Priors on p were obtained by transforming each sample of z

HIERARCHICAL BAYESIAN MODELS 589

participantrsquos true ability zi is simply a sample from thisnormal The number of successes is modeled as

(32)

with a parent distribution (first-stage prior) given by

(33)

In this case parameters μ and σ 2 describe the locationand variance of the parent distribution and reflect group-level properties Priors are needed on μ and σ 2 (second-stage priors) Suitable choices are

(34)

andσ 2 ~ Inverse Gamma (a b) (35)

Values of μ0 σ 02 a and b must be specif ied before

analysisWe performed an analysis on hypothetical data in order

to demonstrate the hierarchical model Table 2 shows hy-pothetical data for a set of 20 participants performing 50trials These success frequencies were generated from abinomial but each individual had a unique true proba-bility of success These true probabilities2 which rangedbetween 59 and 80 are shown in the second column ofTable 2 Parameter estimators p0 (Equation 12) and p2(Equation 14) are shown

All of the theoretical results needed to derive the con-ditional posterior distributions have already been pre-

sented The Albert and Chib (1995) analysis for this ap-plication is provided in Appendix B A coded version ofMCMC for this model is presented in Appendix C

The MCMC chain was run for 10000 iterations Pri-ors were set to be fairly diffuse (μ0 0 σ 0

2 1000 a b 1) All parameters converged to steady-state behav-ior quickly (burn-in was conservative at 100 iterations)Autocorrelation of samples of pi for 2 select participantsis shown in the bottom panels of Figure 16 The auto-correlation in these plots was typical of other param-eters Autocorrelation for this model was no worse thanthat for the nonhierarchical Bayesian model of the pre-ceding section The top left panel of Figure 16 showsposterior means of the probability of success as a func-tion of the true probability of success for the hierarchi-cal model (filled circles) Also included are classical estimates from p0 (filled circles) Points from the hier-archical model are on average closer to the diagonalAs is indicated in Table 2 the hierarchical model haslower root mean squared error than does its frequentistcounterpart p0 The top right panel shows the hierarchi-cal estimates as a function of p0 The effect of the hier-archical structure is clear The extreme estimates in thefrequentist approach are moderated in the Bayesiananalysis

Estimator p2 also moderates extreme values The ten-dency is to pull them toward the middle specifically es-timates are closer to 5 than they are with p0 For the dataof Table 2 the difference in efficiency between p2 andp0 is exceedingly small (about 0001) The reason thatthere was little gain for p2 is that it pulled estimates to-ward a value of 5 which is not the mean of the trueprobabilities The mean was p 69 and for this valueand these sample sizes estimators p2 and p0 have aboutthe same efficiency The hierarchical estimator is supe-

μ μ σ~ Normal 02( )

zi | μ σ μ σ2 ind~ Normal 2( )

y z N zi i i i| ind~ Binomial Φ( )⎡⎣ ⎤⎦

Figure 14 (Top) Histograms of the post-burn-in samples of zand p respectively for 7 successes out of 10 trials The line fit forp is the true posterior a beta distribution with a primeprime 8 and bprimeprime 4(Bottom) Autocorrelation function for z There is a small degreeof autocorrelation

Figure 15 Example of how a parent distribution affects esti-mation The outlying observation is incompatible with the parentdistribution Bayesian estimation allows for the influence of theparent as a prior The resulting estimate is shifted upward re-ducing error

590 ROUDER AND LU

rior to these nonhierarchical estimators because it pullsoutlying estimates toward the mean value p Of coursethe advantage of the hierarchical model is a function ofsample size With increasing sample sizes the amountof pulling decreases In the limit all three estimatorsconverge to the true individualrsquos probability

A HIERARCHICAL SIGNAL DETECTION MODEL

Our ultimate goal is a signal detection model that ac-counts for both participant and item variability As a pre-cursor we present a hierarchical signal detection modelwithout item variability the model will be expanded inthe next section to include item variability The modelhere is applicable to psychophysical experiments in whichsignal and noise trials are identical replicates respectively

Signal detection parameters are simple subtractions ofprobit-transformed probabilities Let d primei and ci denotesignal detection parameters for the ith participant

and

where pi(h) and p

i(f ) are individualsrsquo true hit and false

alarm probabilities respectively Let hi and fi be the pro-bit transforms hi1( p i

(h)) and fi1( pi(f )) Then

sensitivity and criteria are given by

d primei hi fiand

ci fi

The tactic we follow is to place hierarchical models on hand f rather than on signal detection parameters directlyBecause the relationship between signal detection pa-rameters and probit-transformed probabilities is linearaccurate estimation of hi and fi results in accurate esti-mation of d primei and ci

The model is developed as follows Let Ni and Si de-note the number of noise and signal trials given to the ithparticipant Let y

i(h) and y

i(f ) denote the the number of

hits and false alarms respectively The model for yi(h) and

yi(f ) is given by

and

It is assumed that each participantrsquos parameters aredrawn from parent distributions These parent distribu-tions serve as the first stage of the hierarchical prior andare given by

(36)

and

(37)

Second-stage priors are placed on parent distribution pa-rameters (μk σ k

2) as

(38)

and

(39)σ k k ka b k2 ~ Inverse Gamma h f( ) =

μk k ku v k~ Normal h f( ) =

f iind~ Normal f fμ σ 2( )

hiind~ Normal h hμ σ 2( )

y f N fi i i i( ) | f ind

~ Binomial Φ( )⎡⎣ ⎤⎦

y h S hi i i i( ) | h ind

~ Binomial Φ( )⎡⎣ ⎤⎦

c pi i= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pi i iΦ Φ1 1( ) ( )h f

Table 2Data and Estimates of a Probability of Success

Participant True Prob Data p0 p2 Hierarchical

1 587 33 66 654 6722 621 28 56 558 6153 633 34 68 673 6834 650 29 58 577 6275 659 32 64 635 6606 665 37 74 731 7167 667 30 60 596 6388 667 29 58 577 6279 687 34 68 673 684

10 693 34 68 673 68211 696 34 68 673 68312 701 37 74 731 71513 713 38 76 750 72814 734 38 76 750 72715 736 37 74 731 71716 738 33 66 654 67217 743 37 74 731 71618 759 33 66 654 67219 786 38 76 750 72820 803 42 84 827 771

RMSE 053 053 041Data are numbers of successes in 50 trials

HIERARCHICAL BAYESIAN MODELS 591

Analysis takes place in two steps The first is to obtainposterior samples from hi and fi for each participantThese are probit-transformed parameters that may be es-timated following the exact procedure in the previoussection Separate runs are made for signal trials (produc-ing estimates of hi) and for noise trials (producing esti-mates of fi) These posterior samples are denoted [hi] and[ fi] respectively The second step is to use these posteriorsamples to estimate posterior distributions of individuald primei and ci This is easily accomplished For all samples

(40)

and(41)

Hypothetical data and parameter estimates are shownin Table 3 Individual variability in the data was imple-mented by sampling each participantrsquos d prime and c from alog-normal and from a normal distribution respectivelyTrue d prime values are shown and they range from 123 to234 From these true values of d prime and c true individualhit and false alarm probabilities were computed and in-dividual hit and false alarm frequencies were sampledfrom a binomial Hit and false alarm frequencies wereproduced in this manner for 20 hypothetical individualseach observing 50 signal and 50 noise trials

A conventional sensitivity estimate for each individ-ual was obtained using the formula

Φ1(Hit rate) Φ1(False alarm rate)

The resulting estimates are shown in the last column ofTable 3 labeled Conventional The hierarchical signaldetection model was analyzed with diffuse prior param-eters ah af bh bf 01 uh uf 0 and vh vf 1000 The chain was run for 10000 iterations with thefirst 500 serving as burn-in This choice of burn-in pe-riod is very conservative Autocorrelation for sensitivityfor 2 select participants is displayed in the bottom pan-els of Figure 17 this degree of autocorrelation was typ-ical for all parameters Autocorrelation was about thesame as in the previous application and the resulting es-timates are shown in the next-to-last column of Table 3

The top left panel of Figure 17 shows a plot of esti-mated dprime as a function of true d prime Estimates from the hi-erarchical model are closer to the diagonal than are theconventional estimates indicating more accurate esti-mation Table 3 also provides efficiency information thehierarchical model is 33 more efficient than the con-ventional estimates (this amount is typical of repeatedruns of the same simulation) The reason for this gain isevident in the top right panel of Figure 17 The extremeestimates in the conventional method which most likely

c fi m i m⎡⎣ ⎤⎦ = minus ⎡⎣ ⎤⎦

prime⎡⎣ ⎤⎦ = ⎡⎣ ⎤⎦ minus ⎡⎣ ⎤⎦d h fi m i m m

ConventionalHierarchical

Participant 5 Participant 15

True Probability Conventional Estimates

Lag Lag

5 6 7 8 9 5 6 7 8 9

0 10 20 30 40 0 10 20 30 40

Aut

ocor

rela

tion

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

9

8

7

6

5

8

4

0

9

8

7

6

5

8

4

0

Figure 16 Modeling the probability of success across several partici-pants (Top left) Probability estimates as functions of true probabilitythe triangles represent estimator p0 and the circles represent posteriormeans of pi from the hierarchical model (Top right) Hierarchical esti-mates as a function of p0 (Bottom) Autocorrelation functions of the pos-terior probability estimates for 2 participants from the hierarchicalmodel The degree of autocorrelation is typical of the other participants

592 ROUDER AND LU

reflect sample noise are shrunk to more moderate esti-mates in the hierarchical model

SIGNAL DETECTION WITH ITEMVARIABILITY

In the introduction we stressed that Bayesian hierar-chical modeling may provide a solution for the statisti-cal problems associated with unmodeled variability innonlinear contexts In all of the previous examples thegains from hierarchical modeling were in better effi-ciency Conventional analysis although not as accurateas Bayesian analysis certainly yielded consistent if notunbiased estimates In this section item variability isadded to signal detection Conventional analyses inwhich scores are aggregated over items result in incon-sistent estimates characterized by asymptotic bias Inthis section we show that in contrast to conventionalanalysis Bayesian hierarchical models provide consis-tent and accurate estimation

Our route to the model is a bit circuitous Our first at-tempt based on a straightforward extension of the pre-vious techniques fails because of an excessive amountof autocorrelation The solution to this failure is to use apowerful and recently developed sampling schemeblocked sampling (Roberts amp Sahu 1997) First we willproceed with the straightforward but problematic exten-sion Then we will introduce blocked sampling and showhow it leads to tractable analysis

Straightforward ExtensionIt is straightforward to specify models that include

item effects The previous notation is here expanded byindexing hits and false alarms by both participants and

items Let yij(h) and yij

(f ) denote the response of the ith per-son to the jth item Values of yij

(h) and yij(f ) are dichoto-

mous indicating whether the response was old or newLet pij

(h) and pij(f ) be true probabilities that yij

(h) 1 andyij

(f ) 1 respectively that is these probabilities are truehit and false alarm rates for a given participant-by-itemcombination respectively Let hij and fij be the probittransforms [hij Φ1( pij

(h)) fij Φ1( pij(f ))] The model

is given by

(42)

and

(43)

Linear models are placed on hij and fij

hij μh αi γj (44)

and

fij μf βi δj (45)

Parameters μh and μf correspond to the grand means ofthe hit and false alarm rates respectively Parameters ααand ββ denote participant effects on hits and false alarms respectively parameters γγ and δδ denote item effects onhits and false alarms respectively These participant anditem effects are assumed to be zero-centered random ef-fects Grand mean signal detection parameters denotedd prime and c are given by

d prime μh μf (46)

and

c μf (47)

y fij ij( ) f ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

y hij ij( )h ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

Table 3Data and Estimates of Sensitivity

FalseParticipant True d prime Hits Alarms Hierarchical Conventional

1 1238 35 7 1608 16052 1239 38 17 1307 11193 1325 38 11 1554 14784 1330 33 16 1146 08805 1342 40 19 1324 11476 1405 39 8 1730 17677 1424 37 12 1466 13508 1460 27 5 1417 13829 1559 35 9 1507 1440

10 1584 37 9 1599 155911 1591 33 8 1486 140712 1624 40 7 1817 192213 1634 35 11 1427 129714 1782 43 13 1687 172415 1894 33 3 1749 196716 1962 45 12 1839 198817 1967 45 4 2235 268718 2124 39 6 1826 194719 2270 45 5 2172 256320 2337 46 6 2186 2580

RMSE 183 275Data are numbers of signal responses in 50 trials

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 14: Rouder & Lu (2005)

586 ROUDER AND LU

edge In the IQ example however much is known aboutthe distribution of IQ scores Suppose our test was normedfor a mean of 100 and a standard deviation of 15 Thisknowledge could be profitably used by setting μ0 100and σ 0

2 152 225 We may not know the particulartestndashretest correlation of our scales but we can surmisethat the standard deviations of replicates should be lessthan the population standard deviation After experiment-ing with the values of a and b we chose a 3 and b 100 The prior on σ 2 corresponding to these values isshown in Figure 11 It has much of its density below 100reflecting our belief that the testndashretest variability issmaller than the variability in the population Figure 11also shows a histogram of the posterior of μ (the posteriorwas obtained with Gibbs sampling as outlined above[σ 2]1 1 burn-in of 100 iterations) The posterior meanis 10795 which is somewhat below the sample mean of10825 This difference is from the prior people on aver-age score lower than the obtained scores Although it is abit hard to see in the figure the posterior distribution ismore narrow in the extreme tails than is the correspond-ing t distribution The 95 credible interval is [10211136] which is modestly different from the correspond-ing frequentist confidence interval This analysis shows

that the use of reasonable prior information can lead to aresult different from that of the frequentist analysis TheBayesian estimate is not only different but often more ef-ficient than the frequentist counterpart

The theoretical underpinnings of MCMC guaranteethat infinitely long chains will converge to the true pos-terior Researchers must decide how long to burn in chainsand then how long to continue in order to approximatethe posterior well The most important consideration inthis decision is the degree of correlation from one sam-ple to another In MCMC sampling consecutive samplesare not necessarily independent of one another The valueof a sample on cycle m 1 is in general dependent onthat of m If this dependence is small then convergencehappens quickly and relatively short chains are neededIf this dependence is large then convergence is muchslower One informal graphical method of assessing thedegree of dependence is to plot the autocorrelation func-tions of the chains The bottom right panel of Figure 11shows that for the normal application there appears to beno autocorrelation that is there is no dependence fromone sample to the next This fact implies that conver-gence is relatively rapid Good approximations may beobtained with short burn-ins and relatively short chains

Figure 11 Analysis with informative priors The two top panels showthe prior distributions on μμ and σσ 2 The bottom left panel shows the his-togram of samples of μμ after burn-in It differs from a t distribution fromfrequentist analysis in that it is shifted toward 100 and has a smallerupper tail The bottom right panel shows the autocorrelation functionfor μμ There is no discernible autocorrelation

HIERARCHICAL BAYESIAN MODELS 587

There are a number of formal procedures for assessingconvergence in the statistical literature (see eg Gel-man amp Rubin 1992 Geweke 1992 Raftery amp Lewis1992 see also Gelman et al 2004)

A PROBIT MODEL FOR ESTIMATING A PROBABILITY OF SUCCESS

The next step toward a hierarchical signal detectionmodel is to revisit estimation of a probability from indi-vidual trials Once again y successes are observed in Ntrials and y ~ Binomial(N p) where p is the parameterof interest In the previous section a beta prior was placedon parameter p and the conjugacy of the beta with the bi-nomial directly led to the posterior Unfortunately a betadistribution prior is not convenient as a prior for the sig-nal detection application

The signal detection model relies on the probit trans-form Figure 12 shows the probit transform which mapsprobabilities into z values using the normal inverse cu-mulative distribution function Whereas p ranges from 0to 1 z ranges across the real number line Let Φ denotethe standard normal cumulative distribution function Withthis notation p Φ(z) and conversely z Φ1( p)

Signal detection parameters are defined in terms ofprobit transforms

and

where p(h) and p(f ) are hit and false alarm probabilities re-spectively As a precursor to models of signal detectionwe develop estimation of the probit transform of a proba-bility of success z Φ1( p) The methods for doing sowill then be used in the signal detection application

When using a probit transform we model the y suc-cesses in N trials as

y ~ Binomial[N Φ(z)]

where z is the parameter of interestWhen using a probit it is convenient to place a normal

prior on z

(31)

Figure 13 shows various normal priors on z and by trans-form the corresponding prior on p Consider the specialcase when a standard normal is placed on z As is shownin the figure the corresponding prior on p is flat A flatprior also corresponds to a beta with a b 1 (see Fig-ure 6) From the previous development if a beta prior isplaced on p the posterior is also a beta with parametersaprime a y and bprime b (N y) Noting for a flat priorthat a b 1 it is expected that the posterior of p is dis-tributed as a beta with parameters aprime 1 y and bprime 1 (N y)

The next step is to derive the posterior f (z |Y ) for anormal prior on z The straightforward approach is tonote that the posterior is proportional to the likelihoodmultiplied by the prior

This equation is intractable Albert and Chib (1995) pro-vide an alternative for analysis and the details of this al-ternative are provided in Appendix A The basic idea isthat a new set of latent variables is defined upon which zis conditioned Conditional posteriors for both z andthese new latent variables are easily derived and sam-pled These samples can then be used in Gibbs samplingto find a marginal posterior for z

In this application it may be desirable to estimate theposterior of p as well as that of z This estimation is easyin MCMC Because p is the inverse probit transform ofz samples can also be inversely transformed Let [z] bethe chain of samples from the posterior of z A chain ofposterior samples of p is given by [ p]m Φ( [z]m)

The top right panel of Figure 14 shows the histogramof the posterior of p for y 7 successes on N 10 tri-als The prior parameters were μ0 0 and σ 0

2 1 so theprior on p was flat (see Figure 13) Because the prior isflat it corresponds to a beta with parameters (a 1 b 1) Consequently the expected posterior is a beta withdensity aprime y 1 8 and bprime (N y) 1 4 Thechains in the Gibbs sampler were started from a numberof values of [z]1 and convergence was always rapid Theresulting histogram for 10000 samples is plotted it ap-proximates the expected distribution The bottom panelshows a small degree of autocorrelation This autocorre-lation can be overcome by running longer chains andusing a more conservative burn-in period Chains of

f z Y z z

z

Y N Y( | ) ( ) [ ( )]

exp

Φ Φ1

20

2

02 1

minus

times minus minus( ) ( )minus

minusμ σ⎡⎡

⎣⎢⎤⎦⎥

z ~ Normal 0μ σ02( )

c p= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pΦ Φ1 1( ) ( )h f

Figure 12 Probit transform of probability ( p) into z scores (z)The transform is the inverse cumulative distribution function ofthe normal distribution Lines show the transform of p 69 intoz 05

588 ROUDER AND LU

1000 samples with 100 samples of burn-in are morethan sufficient for this application

ESTIMATING PROBABILITY FORSEVERAL PARTICIPANTS

In this section we propose a model for estimating sev-eral participantsrsquo probability of success The main goalis to introduce hierarchical models within the context ofa relatively simple example The methods and insightsdiscussed here are directly applicable to the hierarchicalmodel of signal detection Consider data from a numberof participants each with his or her own unique proba-bility of success pi In conventional estimation each ofthese probabilities is estimated separately usually by theestimator piyi Ni where yi and Ni are the number ofsuccesses and trials respectively for the ith participantWe show here that hierarchical modeling leads to moreaccurate estimation

In a hierarchical model it is assumed that although in-dividuals vary in ability the distribution governing indi-vidualsrsquo true abilities is orderly We term this distributionthe parent distribution If we knew the parent distributionthis prior information would be useful in estimationConsider the example in Figure 15 Suppose 10 individ-uals each performed 10 trials Hypothetical performanceis indicated with Xs Letrsquos suppose the task is relativelyeasy so that individualsrsquo true probabilities of successrange from 7 to 10 as indicated by the curve in the fig-

ure One participant does poorly however succeeding ononly 3 trials The conventional estimate for this poor-performing participant is 3 which is a poor estimate ofthe true probability (between 7 and 10) If the parentdistribution is known it would be evident that this poorscore is influenced by a large degree of sample noise andshould be corrected upward as indicated by the arrowThe new estimate which is more accurate is denoted byan H (for hierarchical)

In the example above we assumed a parent distribu-tion In actuality such assumptions are unwarranted Inhierarchical models parent distributions are estimatedfrom the data these parents in turn affect the estimatesfrom outlying data The estimated parent for the data inFigure 15 would have much of its mass in the upperranges When serving as a prior the estimated parentwould exert an upward tendency in the estimation of theoutlying participantrsquos probability leading to better esti-mation In Bayesian analysis the method of implement-ing a hierarchical model is to use a hierarchical priorThe parent distribution serves as a prior for each indi-vidual and this parent is termed the first stage of theprior The parent distribution is not fully specified butinstead free parameters of the parent are estimated fromthe data These parameters also need a prior on themwhich is termed the second stage of the prior

The probit transform model is ideally suited for a hi-erarchical prior A normally distributed parent may beplaced on transformed probabilities of success The ith

Prior 1 Prior 2 Prior 3

Priorson z

Corresponding Priors on p

N(01) N(11) N(0 2)4

3

2

1

0

4

3

2

1

0

4

3

2

1

0

ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6

5

4

3

2

1

0

5

4

3

2

1

0

5

4

3

2

1

0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Figure 13 Three normally distributed priors on z and the corresponding priors on p Priors on z were obtainedby drawing 100000 samples from the normal Priors on p were obtained by transforming each sample of z

HIERARCHICAL BAYESIAN MODELS 589

participantrsquos true ability zi is simply a sample from thisnormal The number of successes is modeled as

(32)

with a parent distribution (first-stage prior) given by

(33)

In this case parameters μ and σ 2 describe the locationand variance of the parent distribution and reflect group-level properties Priors are needed on μ and σ 2 (second-stage priors) Suitable choices are

(34)

andσ 2 ~ Inverse Gamma (a b) (35)

Values of μ0 σ 02 a and b must be specif ied before

analysisWe performed an analysis on hypothetical data in order

to demonstrate the hierarchical model Table 2 shows hy-pothetical data for a set of 20 participants performing 50trials These success frequencies were generated from abinomial but each individual had a unique true proba-bility of success These true probabilities2 which rangedbetween 59 and 80 are shown in the second column ofTable 2 Parameter estimators p0 (Equation 12) and p2(Equation 14) are shown

All of the theoretical results needed to derive the con-ditional posterior distributions have already been pre-

sented The Albert and Chib (1995) analysis for this ap-plication is provided in Appendix B A coded version ofMCMC for this model is presented in Appendix C

The MCMC chain was run for 10000 iterations Pri-ors were set to be fairly diffuse (μ0 0 σ 0

2 1000 a b 1) All parameters converged to steady-state behav-ior quickly (burn-in was conservative at 100 iterations)Autocorrelation of samples of pi for 2 select participantsis shown in the bottom panels of Figure 16 The auto-correlation in these plots was typical of other param-eters Autocorrelation for this model was no worse thanthat for the nonhierarchical Bayesian model of the pre-ceding section The top left panel of Figure 16 showsposterior means of the probability of success as a func-tion of the true probability of success for the hierarchi-cal model (filled circles) Also included are classical estimates from p0 (filled circles) Points from the hier-archical model are on average closer to the diagonalAs is indicated in Table 2 the hierarchical model haslower root mean squared error than does its frequentistcounterpart p0 The top right panel shows the hierarchi-cal estimates as a function of p0 The effect of the hier-archical structure is clear The extreme estimates in thefrequentist approach are moderated in the Bayesiananalysis

Estimator p2 also moderates extreme values The ten-dency is to pull them toward the middle specifically es-timates are closer to 5 than they are with p0 For the dataof Table 2 the difference in efficiency between p2 andp0 is exceedingly small (about 0001) The reason thatthere was little gain for p2 is that it pulled estimates to-ward a value of 5 which is not the mean of the trueprobabilities The mean was p 69 and for this valueand these sample sizes estimators p2 and p0 have aboutthe same efficiency The hierarchical estimator is supe-

μ μ σ~ Normal 02( )

zi | μ σ μ σ2 ind~ Normal 2( )

y z N zi i i i| ind~ Binomial Φ( )⎡⎣ ⎤⎦

Figure 14 (Top) Histograms of the post-burn-in samples of zand p respectively for 7 successes out of 10 trials The line fit forp is the true posterior a beta distribution with a primeprime 8 and bprimeprime 4(Bottom) Autocorrelation function for z There is a small degreeof autocorrelation

Figure 15 Example of how a parent distribution affects esti-mation The outlying observation is incompatible with the parentdistribution Bayesian estimation allows for the influence of theparent as a prior The resulting estimate is shifted upward re-ducing error

590 ROUDER AND LU

rior to these nonhierarchical estimators because it pullsoutlying estimates toward the mean value p Of coursethe advantage of the hierarchical model is a function ofsample size With increasing sample sizes the amountof pulling decreases In the limit all three estimatorsconverge to the true individualrsquos probability

A HIERARCHICAL SIGNAL DETECTION MODEL

Our ultimate goal is a signal detection model that ac-counts for both participant and item variability As a pre-cursor we present a hierarchical signal detection modelwithout item variability the model will be expanded inthe next section to include item variability The modelhere is applicable to psychophysical experiments in whichsignal and noise trials are identical replicates respectively

Signal detection parameters are simple subtractions ofprobit-transformed probabilities Let d primei and ci denotesignal detection parameters for the ith participant

and

where pi(h) and p

i(f ) are individualsrsquo true hit and false

alarm probabilities respectively Let hi and fi be the pro-bit transforms hi1( p i

(h)) and fi1( pi(f )) Then

sensitivity and criteria are given by

d primei hi fiand

ci fi

The tactic we follow is to place hierarchical models on hand f rather than on signal detection parameters directlyBecause the relationship between signal detection pa-rameters and probit-transformed probabilities is linearaccurate estimation of hi and fi results in accurate esti-mation of d primei and ci

The model is developed as follows Let Ni and Si de-note the number of noise and signal trials given to the ithparticipant Let y

i(h) and y

i(f ) denote the the number of

hits and false alarms respectively The model for yi(h) and

yi(f ) is given by

and

It is assumed that each participantrsquos parameters aredrawn from parent distributions These parent distribu-tions serve as the first stage of the hierarchical prior andare given by

(36)

and

(37)

Second-stage priors are placed on parent distribution pa-rameters (μk σ k

2) as

(38)

and

(39)σ k k ka b k2 ~ Inverse Gamma h f( ) =

μk k ku v k~ Normal h f( ) =

f iind~ Normal f fμ σ 2( )

hiind~ Normal h hμ σ 2( )

y f N fi i i i( ) | f ind

~ Binomial Φ( )⎡⎣ ⎤⎦

y h S hi i i i( ) | h ind

~ Binomial Φ( )⎡⎣ ⎤⎦

c pi i= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pi i iΦ Φ1 1( ) ( )h f

Table 2Data and Estimates of a Probability of Success

Participant True Prob Data p0 p2 Hierarchical

1 587 33 66 654 6722 621 28 56 558 6153 633 34 68 673 6834 650 29 58 577 6275 659 32 64 635 6606 665 37 74 731 7167 667 30 60 596 6388 667 29 58 577 6279 687 34 68 673 684

10 693 34 68 673 68211 696 34 68 673 68312 701 37 74 731 71513 713 38 76 750 72814 734 38 76 750 72715 736 37 74 731 71716 738 33 66 654 67217 743 37 74 731 71618 759 33 66 654 67219 786 38 76 750 72820 803 42 84 827 771

RMSE 053 053 041Data are numbers of successes in 50 trials

HIERARCHICAL BAYESIAN MODELS 591

Analysis takes place in two steps The first is to obtainposterior samples from hi and fi for each participantThese are probit-transformed parameters that may be es-timated following the exact procedure in the previoussection Separate runs are made for signal trials (produc-ing estimates of hi) and for noise trials (producing esti-mates of fi) These posterior samples are denoted [hi] and[ fi] respectively The second step is to use these posteriorsamples to estimate posterior distributions of individuald primei and ci This is easily accomplished For all samples

(40)

and(41)

Hypothetical data and parameter estimates are shownin Table 3 Individual variability in the data was imple-mented by sampling each participantrsquos d prime and c from alog-normal and from a normal distribution respectivelyTrue d prime values are shown and they range from 123 to234 From these true values of d prime and c true individualhit and false alarm probabilities were computed and in-dividual hit and false alarm frequencies were sampledfrom a binomial Hit and false alarm frequencies wereproduced in this manner for 20 hypothetical individualseach observing 50 signal and 50 noise trials

A conventional sensitivity estimate for each individ-ual was obtained using the formula

Φ1(Hit rate) Φ1(False alarm rate)

The resulting estimates are shown in the last column ofTable 3 labeled Conventional The hierarchical signaldetection model was analyzed with diffuse prior param-eters ah af bh bf 01 uh uf 0 and vh vf 1000 The chain was run for 10000 iterations with thefirst 500 serving as burn-in This choice of burn-in pe-riod is very conservative Autocorrelation for sensitivityfor 2 select participants is displayed in the bottom pan-els of Figure 17 this degree of autocorrelation was typ-ical for all parameters Autocorrelation was about thesame as in the previous application and the resulting es-timates are shown in the next-to-last column of Table 3

The top left panel of Figure 17 shows a plot of esti-mated dprime as a function of true d prime Estimates from the hi-erarchical model are closer to the diagonal than are theconventional estimates indicating more accurate esti-mation Table 3 also provides efficiency information thehierarchical model is 33 more efficient than the con-ventional estimates (this amount is typical of repeatedruns of the same simulation) The reason for this gain isevident in the top right panel of Figure 17 The extremeestimates in the conventional method which most likely

c fi m i m⎡⎣ ⎤⎦ = minus ⎡⎣ ⎤⎦

prime⎡⎣ ⎤⎦ = ⎡⎣ ⎤⎦ minus ⎡⎣ ⎤⎦d h fi m i m m

ConventionalHierarchical

Participant 5 Participant 15

True Probability Conventional Estimates

Lag Lag

5 6 7 8 9 5 6 7 8 9

0 10 20 30 40 0 10 20 30 40

Aut

ocor

rela

tion

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

9

8

7

6

5

8

4

0

9

8

7

6

5

8

4

0

Figure 16 Modeling the probability of success across several partici-pants (Top left) Probability estimates as functions of true probabilitythe triangles represent estimator p0 and the circles represent posteriormeans of pi from the hierarchical model (Top right) Hierarchical esti-mates as a function of p0 (Bottom) Autocorrelation functions of the pos-terior probability estimates for 2 participants from the hierarchicalmodel The degree of autocorrelation is typical of the other participants

592 ROUDER AND LU

reflect sample noise are shrunk to more moderate esti-mates in the hierarchical model

SIGNAL DETECTION WITH ITEMVARIABILITY

In the introduction we stressed that Bayesian hierar-chical modeling may provide a solution for the statisti-cal problems associated with unmodeled variability innonlinear contexts In all of the previous examples thegains from hierarchical modeling were in better effi-ciency Conventional analysis although not as accurateas Bayesian analysis certainly yielded consistent if notunbiased estimates In this section item variability isadded to signal detection Conventional analyses inwhich scores are aggregated over items result in incon-sistent estimates characterized by asymptotic bias Inthis section we show that in contrast to conventionalanalysis Bayesian hierarchical models provide consis-tent and accurate estimation

Our route to the model is a bit circuitous Our first at-tempt based on a straightforward extension of the pre-vious techniques fails because of an excessive amountof autocorrelation The solution to this failure is to use apowerful and recently developed sampling schemeblocked sampling (Roberts amp Sahu 1997) First we willproceed with the straightforward but problematic exten-sion Then we will introduce blocked sampling and showhow it leads to tractable analysis

Straightforward ExtensionIt is straightforward to specify models that include

item effects The previous notation is here expanded byindexing hits and false alarms by both participants and

items Let yij(h) and yij

(f ) denote the response of the ith per-son to the jth item Values of yij

(h) and yij(f ) are dichoto-

mous indicating whether the response was old or newLet pij

(h) and pij(f ) be true probabilities that yij

(h) 1 andyij

(f ) 1 respectively that is these probabilities are truehit and false alarm rates for a given participant-by-itemcombination respectively Let hij and fij be the probittransforms [hij Φ1( pij

(h)) fij Φ1( pij(f ))] The model

is given by

(42)

and

(43)

Linear models are placed on hij and fij

hij μh αi γj (44)

and

fij μf βi δj (45)

Parameters μh and μf correspond to the grand means ofthe hit and false alarm rates respectively Parameters ααand ββ denote participant effects on hits and false alarms respectively parameters γγ and δδ denote item effects onhits and false alarms respectively These participant anditem effects are assumed to be zero-centered random ef-fects Grand mean signal detection parameters denotedd prime and c are given by

d prime μh μf (46)

and

c μf (47)

y fij ij( ) f ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

y hij ij( )h ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

Table 3Data and Estimates of Sensitivity

FalseParticipant True d prime Hits Alarms Hierarchical Conventional

1 1238 35 7 1608 16052 1239 38 17 1307 11193 1325 38 11 1554 14784 1330 33 16 1146 08805 1342 40 19 1324 11476 1405 39 8 1730 17677 1424 37 12 1466 13508 1460 27 5 1417 13829 1559 35 9 1507 1440

10 1584 37 9 1599 155911 1591 33 8 1486 140712 1624 40 7 1817 192213 1634 35 11 1427 129714 1782 43 13 1687 172415 1894 33 3 1749 196716 1962 45 12 1839 198817 1967 45 4 2235 268718 2124 39 6 1826 194719 2270 45 5 2172 256320 2337 46 6 2186 2580

RMSE 183 275Data are numbers of signal responses in 50 trials

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 15: Rouder & Lu (2005)

HIERARCHICAL BAYESIAN MODELS 587

There are a number of formal procedures for assessingconvergence in the statistical literature (see eg Gel-man amp Rubin 1992 Geweke 1992 Raftery amp Lewis1992 see also Gelman et al 2004)

A PROBIT MODEL FOR ESTIMATING A PROBABILITY OF SUCCESS

The next step toward a hierarchical signal detectionmodel is to revisit estimation of a probability from indi-vidual trials Once again y successes are observed in Ntrials and y ~ Binomial(N p) where p is the parameterof interest In the previous section a beta prior was placedon parameter p and the conjugacy of the beta with the bi-nomial directly led to the posterior Unfortunately a betadistribution prior is not convenient as a prior for the sig-nal detection application

The signal detection model relies on the probit trans-form Figure 12 shows the probit transform which mapsprobabilities into z values using the normal inverse cu-mulative distribution function Whereas p ranges from 0to 1 z ranges across the real number line Let Φ denotethe standard normal cumulative distribution function Withthis notation p Φ(z) and conversely z Φ1( p)

Signal detection parameters are defined in terms ofprobit transforms

and

where p(h) and p(f ) are hit and false alarm probabilities re-spectively As a precursor to models of signal detectionwe develop estimation of the probit transform of a proba-bility of success z Φ1( p) The methods for doing sowill then be used in the signal detection application

When using a probit transform we model the y suc-cesses in N trials as

y ~ Binomial[N Φ(z)]

where z is the parameter of interestWhen using a probit it is convenient to place a normal

prior on z

(31)

Figure 13 shows various normal priors on z and by trans-form the corresponding prior on p Consider the specialcase when a standard normal is placed on z As is shownin the figure the corresponding prior on p is flat A flatprior also corresponds to a beta with a b 1 (see Fig-ure 6) From the previous development if a beta prior isplaced on p the posterior is also a beta with parametersaprime a y and bprime b (N y) Noting for a flat priorthat a b 1 it is expected that the posterior of p is dis-tributed as a beta with parameters aprime 1 y and bprime 1 (N y)

The next step is to derive the posterior f (z |Y ) for anormal prior on z The straightforward approach is tonote that the posterior is proportional to the likelihoodmultiplied by the prior

This equation is intractable Albert and Chib (1995) pro-vide an alternative for analysis and the details of this al-ternative are provided in Appendix A The basic idea isthat a new set of latent variables is defined upon which zis conditioned Conditional posteriors for both z andthese new latent variables are easily derived and sam-pled These samples can then be used in Gibbs samplingto find a marginal posterior for z

In this application it may be desirable to estimate theposterior of p as well as that of z This estimation is easyin MCMC Because p is the inverse probit transform ofz samples can also be inversely transformed Let [z] bethe chain of samples from the posterior of z A chain ofposterior samples of p is given by [ p]m Φ( [z]m)

The top right panel of Figure 14 shows the histogramof the posterior of p for y 7 successes on N 10 tri-als The prior parameters were μ0 0 and σ 0

2 1 so theprior on p was flat (see Figure 13) Because the prior isflat it corresponds to a beta with parameters (a 1 b 1) Consequently the expected posterior is a beta withdensity aprime y 1 8 and bprime (N y) 1 4 Thechains in the Gibbs sampler were started from a numberof values of [z]1 and convergence was always rapid Theresulting histogram for 10000 samples is plotted it ap-proximates the expected distribution The bottom panelshows a small degree of autocorrelation This autocorre-lation can be overcome by running longer chains andusing a more conservative burn-in period Chains of

f z Y z z

z

Y N Y( | ) ( ) [ ( )]

exp

Φ Φ1

20

2

02 1

minus

times minus minus( ) ( )minus

minusμ σ⎡⎡

⎣⎢⎤⎦⎥

z ~ Normal 0μ σ02( )

c p= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pΦ Φ1 1( ) ( )h f

Figure 12 Probit transform of probability ( p) into z scores (z)The transform is the inverse cumulative distribution function ofthe normal distribution Lines show the transform of p 69 intoz 05

588 ROUDER AND LU

1000 samples with 100 samples of burn-in are morethan sufficient for this application

ESTIMATING PROBABILITY FORSEVERAL PARTICIPANTS

In this section we propose a model for estimating sev-eral participantsrsquo probability of success The main goalis to introduce hierarchical models within the context ofa relatively simple example The methods and insightsdiscussed here are directly applicable to the hierarchicalmodel of signal detection Consider data from a numberof participants each with his or her own unique proba-bility of success pi In conventional estimation each ofthese probabilities is estimated separately usually by theestimator piyi Ni where yi and Ni are the number ofsuccesses and trials respectively for the ith participantWe show here that hierarchical modeling leads to moreaccurate estimation

In a hierarchical model it is assumed that although in-dividuals vary in ability the distribution governing indi-vidualsrsquo true abilities is orderly We term this distributionthe parent distribution If we knew the parent distributionthis prior information would be useful in estimationConsider the example in Figure 15 Suppose 10 individ-uals each performed 10 trials Hypothetical performanceis indicated with Xs Letrsquos suppose the task is relativelyeasy so that individualsrsquo true probabilities of successrange from 7 to 10 as indicated by the curve in the fig-

ure One participant does poorly however succeeding ononly 3 trials The conventional estimate for this poor-performing participant is 3 which is a poor estimate ofthe true probability (between 7 and 10) If the parentdistribution is known it would be evident that this poorscore is influenced by a large degree of sample noise andshould be corrected upward as indicated by the arrowThe new estimate which is more accurate is denoted byan H (for hierarchical)

In the example above we assumed a parent distribu-tion In actuality such assumptions are unwarranted Inhierarchical models parent distributions are estimatedfrom the data these parents in turn affect the estimatesfrom outlying data The estimated parent for the data inFigure 15 would have much of its mass in the upperranges When serving as a prior the estimated parentwould exert an upward tendency in the estimation of theoutlying participantrsquos probability leading to better esti-mation In Bayesian analysis the method of implement-ing a hierarchical model is to use a hierarchical priorThe parent distribution serves as a prior for each indi-vidual and this parent is termed the first stage of theprior The parent distribution is not fully specified butinstead free parameters of the parent are estimated fromthe data These parameters also need a prior on themwhich is termed the second stage of the prior

The probit transform model is ideally suited for a hi-erarchical prior A normally distributed parent may beplaced on transformed probabilities of success The ith

Prior 1 Prior 2 Prior 3

Priorson z

Corresponding Priors on p

N(01) N(11) N(0 2)4

3

2

1

0

4

3

2

1

0

4

3

2

1

0

ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6

5

4

3

2

1

0

5

4

3

2

1

0

5

4

3

2

1

0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Figure 13 Three normally distributed priors on z and the corresponding priors on p Priors on z were obtainedby drawing 100000 samples from the normal Priors on p were obtained by transforming each sample of z

HIERARCHICAL BAYESIAN MODELS 589

participantrsquos true ability zi is simply a sample from thisnormal The number of successes is modeled as

(32)

with a parent distribution (first-stage prior) given by

(33)

In this case parameters μ and σ 2 describe the locationand variance of the parent distribution and reflect group-level properties Priors are needed on μ and σ 2 (second-stage priors) Suitable choices are

(34)

andσ 2 ~ Inverse Gamma (a b) (35)

Values of μ0 σ 02 a and b must be specif ied before

analysisWe performed an analysis on hypothetical data in order

to demonstrate the hierarchical model Table 2 shows hy-pothetical data for a set of 20 participants performing 50trials These success frequencies were generated from abinomial but each individual had a unique true proba-bility of success These true probabilities2 which rangedbetween 59 and 80 are shown in the second column ofTable 2 Parameter estimators p0 (Equation 12) and p2(Equation 14) are shown

All of the theoretical results needed to derive the con-ditional posterior distributions have already been pre-

sented The Albert and Chib (1995) analysis for this ap-plication is provided in Appendix B A coded version ofMCMC for this model is presented in Appendix C

The MCMC chain was run for 10000 iterations Pri-ors were set to be fairly diffuse (μ0 0 σ 0

2 1000 a b 1) All parameters converged to steady-state behav-ior quickly (burn-in was conservative at 100 iterations)Autocorrelation of samples of pi for 2 select participantsis shown in the bottom panels of Figure 16 The auto-correlation in these plots was typical of other param-eters Autocorrelation for this model was no worse thanthat for the nonhierarchical Bayesian model of the pre-ceding section The top left panel of Figure 16 showsposterior means of the probability of success as a func-tion of the true probability of success for the hierarchi-cal model (filled circles) Also included are classical estimates from p0 (filled circles) Points from the hier-archical model are on average closer to the diagonalAs is indicated in Table 2 the hierarchical model haslower root mean squared error than does its frequentistcounterpart p0 The top right panel shows the hierarchi-cal estimates as a function of p0 The effect of the hier-archical structure is clear The extreme estimates in thefrequentist approach are moderated in the Bayesiananalysis

Estimator p2 also moderates extreme values The ten-dency is to pull them toward the middle specifically es-timates are closer to 5 than they are with p0 For the dataof Table 2 the difference in efficiency between p2 andp0 is exceedingly small (about 0001) The reason thatthere was little gain for p2 is that it pulled estimates to-ward a value of 5 which is not the mean of the trueprobabilities The mean was p 69 and for this valueand these sample sizes estimators p2 and p0 have aboutthe same efficiency The hierarchical estimator is supe-

μ μ σ~ Normal 02( )

zi | μ σ μ σ2 ind~ Normal 2( )

y z N zi i i i| ind~ Binomial Φ( )⎡⎣ ⎤⎦

Figure 14 (Top) Histograms of the post-burn-in samples of zand p respectively for 7 successes out of 10 trials The line fit forp is the true posterior a beta distribution with a primeprime 8 and bprimeprime 4(Bottom) Autocorrelation function for z There is a small degreeof autocorrelation

Figure 15 Example of how a parent distribution affects esti-mation The outlying observation is incompatible with the parentdistribution Bayesian estimation allows for the influence of theparent as a prior The resulting estimate is shifted upward re-ducing error

590 ROUDER AND LU

rior to these nonhierarchical estimators because it pullsoutlying estimates toward the mean value p Of coursethe advantage of the hierarchical model is a function ofsample size With increasing sample sizes the amountof pulling decreases In the limit all three estimatorsconverge to the true individualrsquos probability

A HIERARCHICAL SIGNAL DETECTION MODEL

Our ultimate goal is a signal detection model that ac-counts for both participant and item variability As a pre-cursor we present a hierarchical signal detection modelwithout item variability the model will be expanded inthe next section to include item variability The modelhere is applicable to psychophysical experiments in whichsignal and noise trials are identical replicates respectively

Signal detection parameters are simple subtractions ofprobit-transformed probabilities Let d primei and ci denotesignal detection parameters for the ith participant

and

where pi(h) and p

i(f ) are individualsrsquo true hit and false

alarm probabilities respectively Let hi and fi be the pro-bit transforms hi1( p i

(h)) and fi1( pi(f )) Then

sensitivity and criteria are given by

d primei hi fiand

ci fi

The tactic we follow is to place hierarchical models on hand f rather than on signal detection parameters directlyBecause the relationship between signal detection pa-rameters and probit-transformed probabilities is linearaccurate estimation of hi and fi results in accurate esti-mation of d primei and ci

The model is developed as follows Let Ni and Si de-note the number of noise and signal trials given to the ithparticipant Let y

i(h) and y

i(f ) denote the the number of

hits and false alarms respectively The model for yi(h) and

yi(f ) is given by

and

It is assumed that each participantrsquos parameters aredrawn from parent distributions These parent distribu-tions serve as the first stage of the hierarchical prior andare given by

(36)

and

(37)

Second-stage priors are placed on parent distribution pa-rameters (μk σ k

2) as

(38)

and

(39)σ k k ka b k2 ~ Inverse Gamma h f( ) =

μk k ku v k~ Normal h f( ) =

f iind~ Normal f fμ σ 2( )

hiind~ Normal h hμ σ 2( )

y f N fi i i i( ) | f ind

~ Binomial Φ( )⎡⎣ ⎤⎦

y h S hi i i i( ) | h ind

~ Binomial Φ( )⎡⎣ ⎤⎦

c pi i= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pi i iΦ Φ1 1( ) ( )h f

Table 2Data and Estimates of a Probability of Success

Participant True Prob Data p0 p2 Hierarchical

1 587 33 66 654 6722 621 28 56 558 6153 633 34 68 673 6834 650 29 58 577 6275 659 32 64 635 6606 665 37 74 731 7167 667 30 60 596 6388 667 29 58 577 6279 687 34 68 673 684

10 693 34 68 673 68211 696 34 68 673 68312 701 37 74 731 71513 713 38 76 750 72814 734 38 76 750 72715 736 37 74 731 71716 738 33 66 654 67217 743 37 74 731 71618 759 33 66 654 67219 786 38 76 750 72820 803 42 84 827 771

RMSE 053 053 041Data are numbers of successes in 50 trials

HIERARCHICAL BAYESIAN MODELS 591

Analysis takes place in two steps The first is to obtainposterior samples from hi and fi for each participantThese are probit-transformed parameters that may be es-timated following the exact procedure in the previoussection Separate runs are made for signal trials (produc-ing estimates of hi) and for noise trials (producing esti-mates of fi) These posterior samples are denoted [hi] and[ fi] respectively The second step is to use these posteriorsamples to estimate posterior distributions of individuald primei and ci This is easily accomplished For all samples

(40)

and(41)

Hypothetical data and parameter estimates are shownin Table 3 Individual variability in the data was imple-mented by sampling each participantrsquos d prime and c from alog-normal and from a normal distribution respectivelyTrue d prime values are shown and they range from 123 to234 From these true values of d prime and c true individualhit and false alarm probabilities were computed and in-dividual hit and false alarm frequencies were sampledfrom a binomial Hit and false alarm frequencies wereproduced in this manner for 20 hypothetical individualseach observing 50 signal and 50 noise trials

A conventional sensitivity estimate for each individ-ual was obtained using the formula

Φ1(Hit rate) Φ1(False alarm rate)

The resulting estimates are shown in the last column ofTable 3 labeled Conventional The hierarchical signaldetection model was analyzed with diffuse prior param-eters ah af bh bf 01 uh uf 0 and vh vf 1000 The chain was run for 10000 iterations with thefirst 500 serving as burn-in This choice of burn-in pe-riod is very conservative Autocorrelation for sensitivityfor 2 select participants is displayed in the bottom pan-els of Figure 17 this degree of autocorrelation was typ-ical for all parameters Autocorrelation was about thesame as in the previous application and the resulting es-timates are shown in the next-to-last column of Table 3

The top left panel of Figure 17 shows a plot of esti-mated dprime as a function of true d prime Estimates from the hi-erarchical model are closer to the diagonal than are theconventional estimates indicating more accurate esti-mation Table 3 also provides efficiency information thehierarchical model is 33 more efficient than the con-ventional estimates (this amount is typical of repeatedruns of the same simulation) The reason for this gain isevident in the top right panel of Figure 17 The extremeestimates in the conventional method which most likely

c fi m i m⎡⎣ ⎤⎦ = minus ⎡⎣ ⎤⎦

prime⎡⎣ ⎤⎦ = ⎡⎣ ⎤⎦ minus ⎡⎣ ⎤⎦d h fi m i m m

ConventionalHierarchical

Participant 5 Participant 15

True Probability Conventional Estimates

Lag Lag

5 6 7 8 9 5 6 7 8 9

0 10 20 30 40 0 10 20 30 40

Aut

ocor

rela

tion

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

9

8

7

6

5

8

4

0

9

8

7

6

5

8

4

0

Figure 16 Modeling the probability of success across several partici-pants (Top left) Probability estimates as functions of true probabilitythe triangles represent estimator p0 and the circles represent posteriormeans of pi from the hierarchical model (Top right) Hierarchical esti-mates as a function of p0 (Bottom) Autocorrelation functions of the pos-terior probability estimates for 2 participants from the hierarchicalmodel The degree of autocorrelation is typical of the other participants

592 ROUDER AND LU

reflect sample noise are shrunk to more moderate esti-mates in the hierarchical model

SIGNAL DETECTION WITH ITEMVARIABILITY

In the introduction we stressed that Bayesian hierar-chical modeling may provide a solution for the statisti-cal problems associated with unmodeled variability innonlinear contexts In all of the previous examples thegains from hierarchical modeling were in better effi-ciency Conventional analysis although not as accurateas Bayesian analysis certainly yielded consistent if notunbiased estimates In this section item variability isadded to signal detection Conventional analyses inwhich scores are aggregated over items result in incon-sistent estimates characterized by asymptotic bias Inthis section we show that in contrast to conventionalanalysis Bayesian hierarchical models provide consis-tent and accurate estimation

Our route to the model is a bit circuitous Our first at-tempt based on a straightforward extension of the pre-vious techniques fails because of an excessive amountof autocorrelation The solution to this failure is to use apowerful and recently developed sampling schemeblocked sampling (Roberts amp Sahu 1997) First we willproceed with the straightforward but problematic exten-sion Then we will introduce blocked sampling and showhow it leads to tractable analysis

Straightforward ExtensionIt is straightforward to specify models that include

item effects The previous notation is here expanded byindexing hits and false alarms by both participants and

items Let yij(h) and yij

(f ) denote the response of the ith per-son to the jth item Values of yij

(h) and yij(f ) are dichoto-

mous indicating whether the response was old or newLet pij

(h) and pij(f ) be true probabilities that yij

(h) 1 andyij

(f ) 1 respectively that is these probabilities are truehit and false alarm rates for a given participant-by-itemcombination respectively Let hij and fij be the probittransforms [hij Φ1( pij

(h)) fij Φ1( pij(f ))] The model

is given by

(42)

and

(43)

Linear models are placed on hij and fij

hij μh αi γj (44)

and

fij μf βi δj (45)

Parameters μh and μf correspond to the grand means ofthe hit and false alarm rates respectively Parameters ααand ββ denote participant effects on hits and false alarms respectively parameters γγ and δδ denote item effects onhits and false alarms respectively These participant anditem effects are assumed to be zero-centered random ef-fects Grand mean signal detection parameters denotedd prime and c are given by

d prime μh μf (46)

and

c μf (47)

y fij ij( ) f ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

y hij ij( )h ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

Table 3Data and Estimates of Sensitivity

FalseParticipant True d prime Hits Alarms Hierarchical Conventional

1 1238 35 7 1608 16052 1239 38 17 1307 11193 1325 38 11 1554 14784 1330 33 16 1146 08805 1342 40 19 1324 11476 1405 39 8 1730 17677 1424 37 12 1466 13508 1460 27 5 1417 13829 1559 35 9 1507 1440

10 1584 37 9 1599 155911 1591 33 8 1486 140712 1624 40 7 1817 192213 1634 35 11 1427 129714 1782 43 13 1687 172415 1894 33 3 1749 196716 1962 45 12 1839 198817 1967 45 4 2235 268718 2124 39 6 1826 194719 2270 45 5 2172 256320 2337 46 6 2186 2580

RMSE 183 275Data are numbers of signal responses in 50 trials

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 16: Rouder & Lu (2005)

588 ROUDER AND LU

1000 samples with 100 samples of burn-in are morethan sufficient for this application

ESTIMATING PROBABILITY FORSEVERAL PARTICIPANTS

In this section we propose a model for estimating sev-eral participantsrsquo probability of success The main goalis to introduce hierarchical models within the context ofa relatively simple example The methods and insightsdiscussed here are directly applicable to the hierarchicalmodel of signal detection Consider data from a numberof participants each with his or her own unique proba-bility of success pi In conventional estimation each ofthese probabilities is estimated separately usually by theestimator piyi Ni where yi and Ni are the number ofsuccesses and trials respectively for the ith participantWe show here that hierarchical modeling leads to moreaccurate estimation

In a hierarchical model it is assumed that although in-dividuals vary in ability the distribution governing indi-vidualsrsquo true abilities is orderly We term this distributionthe parent distribution If we knew the parent distributionthis prior information would be useful in estimationConsider the example in Figure 15 Suppose 10 individ-uals each performed 10 trials Hypothetical performanceis indicated with Xs Letrsquos suppose the task is relativelyeasy so that individualsrsquo true probabilities of successrange from 7 to 10 as indicated by the curve in the fig-

ure One participant does poorly however succeeding ononly 3 trials The conventional estimate for this poor-performing participant is 3 which is a poor estimate ofthe true probability (between 7 and 10) If the parentdistribution is known it would be evident that this poorscore is influenced by a large degree of sample noise andshould be corrected upward as indicated by the arrowThe new estimate which is more accurate is denoted byan H (for hierarchical)

In the example above we assumed a parent distribu-tion In actuality such assumptions are unwarranted Inhierarchical models parent distributions are estimatedfrom the data these parents in turn affect the estimatesfrom outlying data The estimated parent for the data inFigure 15 would have much of its mass in the upperranges When serving as a prior the estimated parentwould exert an upward tendency in the estimation of theoutlying participantrsquos probability leading to better esti-mation In Bayesian analysis the method of implement-ing a hierarchical model is to use a hierarchical priorThe parent distribution serves as a prior for each indi-vidual and this parent is termed the first stage of theprior The parent distribution is not fully specified butinstead free parameters of the parent are estimated fromthe data These parameters also need a prior on themwhich is termed the second stage of the prior

The probit transform model is ideally suited for a hi-erarchical prior A normally distributed parent may beplaced on transformed probabilities of success The ith

Prior 1 Prior 2 Prior 3

Priorson z

Corresponding Priors on p

N(01) N(11) N(0 2)4

3

2

1

0

4

3

2

1

0

4

3

2

1

0

ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6 ndash6 ndash2 2 4 6

5

4

3

2

1

0

5

4

3

2

1

0

5

4

3

2

1

0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

Figure 13 Three normally distributed priors on z and the corresponding priors on p Priors on z were obtainedby drawing 100000 samples from the normal Priors on p were obtained by transforming each sample of z

HIERARCHICAL BAYESIAN MODELS 589

participantrsquos true ability zi is simply a sample from thisnormal The number of successes is modeled as

(32)

with a parent distribution (first-stage prior) given by

(33)

In this case parameters μ and σ 2 describe the locationand variance of the parent distribution and reflect group-level properties Priors are needed on μ and σ 2 (second-stage priors) Suitable choices are

(34)

andσ 2 ~ Inverse Gamma (a b) (35)

Values of μ0 σ 02 a and b must be specif ied before

analysisWe performed an analysis on hypothetical data in order

to demonstrate the hierarchical model Table 2 shows hy-pothetical data for a set of 20 participants performing 50trials These success frequencies were generated from abinomial but each individual had a unique true proba-bility of success These true probabilities2 which rangedbetween 59 and 80 are shown in the second column ofTable 2 Parameter estimators p0 (Equation 12) and p2(Equation 14) are shown

All of the theoretical results needed to derive the con-ditional posterior distributions have already been pre-

sented The Albert and Chib (1995) analysis for this ap-plication is provided in Appendix B A coded version ofMCMC for this model is presented in Appendix C

The MCMC chain was run for 10000 iterations Pri-ors were set to be fairly diffuse (μ0 0 σ 0

2 1000 a b 1) All parameters converged to steady-state behav-ior quickly (burn-in was conservative at 100 iterations)Autocorrelation of samples of pi for 2 select participantsis shown in the bottom panels of Figure 16 The auto-correlation in these plots was typical of other param-eters Autocorrelation for this model was no worse thanthat for the nonhierarchical Bayesian model of the pre-ceding section The top left panel of Figure 16 showsposterior means of the probability of success as a func-tion of the true probability of success for the hierarchi-cal model (filled circles) Also included are classical estimates from p0 (filled circles) Points from the hier-archical model are on average closer to the diagonalAs is indicated in Table 2 the hierarchical model haslower root mean squared error than does its frequentistcounterpart p0 The top right panel shows the hierarchi-cal estimates as a function of p0 The effect of the hier-archical structure is clear The extreme estimates in thefrequentist approach are moderated in the Bayesiananalysis

Estimator p2 also moderates extreme values The ten-dency is to pull them toward the middle specifically es-timates are closer to 5 than they are with p0 For the dataof Table 2 the difference in efficiency between p2 andp0 is exceedingly small (about 0001) The reason thatthere was little gain for p2 is that it pulled estimates to-ward a value of 5 which is not the mean of the trueprobabilities The mean was p 69 and for this valueand these sample sizes estimators p2 and p0 have aboutthe same efficiency The hierarchical estimator is supe-

μ μ σ~ Normal 02( )

zi | μ σ μ σ2 ind~ Normal 2( )

y z N zi i i i| ind~ Binomial Φ( )⎡⎣ ⎤⎦

Figure 14 (Top) Histograms of the post-burn-in samples of zand p respectively for 7 successes out of 10 trials The line fit forp is the true posterior a beta distribution with a primeprime 8 and bprimeprime 4(Bottom) Autocorrelation function for z There is a small degreeof autocorrelation

Figure 15 Example of how a parent distribution affects esti-mation The outlying observation is incompatible with the parentdistribution Bayesian estimation allows for the influence of theparent as a prior The resulting estimate is shifted upward re-ducing error

590 ROUDER AND LU

rior to these nonhierarchical estimators because it pullsoutlying estimates toward the mean value p Of coursethe advantage of the hierarchical model is a function ofsample size With increasing sample sizes the amountof pulling decreases In the limit all three estimatorsconverge to the true individualrsquos probability

A HIERARCHICAL SIGNAL DETECTION MODEL

Our ultimate goal is a signal detection model that ac-counts for both participant and item variability As a pre-cursor we present a hierarchical signal detection modelwithout item variability the model will be expanded inthe next section to include item variability The modelhere is applicable to psychophysical experiments in whichsignal and noise trials are identical replicates respectively

Signal detection parameters are simple subtractions ofprobit-transformed probabilities Let d primei and ci denotesignal detection parameters for the ith participant

and

where pi(h) and p

i(f ) are individualsrsquo true hit and false

alarm probabilities respectively Let hi and fi be the pro-bit transforms hi1( p i

(h)) and fi1( pi(f )) Then

sensitivity and criteria are given by

d primei hi fiand

ci fi

The tactic we follow is to place hierarchical models on hand f rather than on signal detection parameters directlyBecause the relationship between signal detection pa-rameters and probit-transformed probabilities is linearaccurate estimation of hi and fi results in accurate esti-mation of d primei and ci

The model is developed as follows Let Ni and Si de-note the number of noise and signal trials given to the ithparticipant Let y

i(h) and y

i(f ) denote the the number of

hits and false alarms respectively The model for yi(h) and

yi(f ) is given by

and

It is assumed that each participantrsquos parameters aredrawn from parent distributions These parent distribu-tions serve as the first stage of the hierarchical prior andare given by

(36)

and

(37)

Second-stage priors are placed on parent distribution pa-rameters (μk σ k

2) as

(38)

and

(39)σ k k ka b k2 ~ Inverse Gamma h f( ) =

μk k ku v k~ Normal h f( ) =

f iind~ Normal f fμ σ 2( )

hiind~ Normal h hμ σ 2( )

y f N fi i i i( ) | f ind

~ Binomial Φ( )⎡⎣ ⎤⎦

y h S hi i i i( ) | h ind

~ Binomial Φ( )⎡⎣ ⎤⎦

c pi i= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pi i iΦ Φ1 1( ) ( )h f

Table 2Data and Estimates of a Probability of Success

Participant True Prob Data p0 p2 Hierarchical

1 587 33 66 654 6722 621 28 56 558 6153 633 34 68 673 6834 650 29 58 577 6275 659 32 64 635 6606 665 37 74 731 7167 667 30 60 596 6388 667 29 58 577 6279 687 34 68 673 684

10 693 34 68 673 68211 696 34 68 673 68312 701 37 74 731 71513 713 38 76 750 72814 734 38 76 750 72715 736 37 74 731 71716 738 33 66 654 67217 743 37 74 731 71618 759 33 66 654 67219 786 38 76 750 72820 803 42 84 827 771

RMSE 053 053 041Data are numbers of successes in 50 trials

HIERARCHICAL BAYESIAN MODELS 591

Analysis takes place in two steps The first is to obtainposterior samples from hi and fi for each participantThese are probit-transformed parameters that may be es-timated following the exact procedure in the previoussection Separate runs are made for signal trials (produc-ing estimates of hi) and for noise trials (producing esti-mates of fi) These posterior samples are denoted [hi] and[ fi] respectively The second step is to use these posteriorsamples to estimate posterior distributions of individuald primei and ci This is easily accomplished For all samples

(40)

and(41)

Hypothetical data and parameter estimates are shownin Table 3 Individual variability in the data was imple-mented by sampling each participantrsquos d prime and c from alog-normal and from a normal distribution respectivelyTrue d prime values are shown and they range from 123 to234 From these true values of d prime and c true individualhit and false alarm probabilities were computed and in-dividual hit and false alarm frequencies were sampledfrom a binomial Hit and false alarm frequencies wereproduced in this manner for 20 hypothetical individualseach observing 50 signal and 50 noise trials

A conventional sensitivity estimate for each individ-ual was obtained using the formula

Φ1(Hit rate) Φ1(False alarm rate)

The resulting estimates are shown in the last column ofTable 3 labeled Conventional The hierarchical signaldetection model was analyzed with diffuse prior param-eters ah af bh bf 01 uh uf 0 and vh vf 1000 The chain was run for 10000 iterations with thefirst 500 serving as burn-in This choice of burn-in pe-riod is very conservative Autocorrelation for sensitivityfor 2 select participants is displayed in the bottom pan-els of Figure 17 this degree of autocorrelation was typ-ical for all parameters Autocorrelation was about thesame as in the previous application and the resulting es-timates are shown in the next-to-last column of Table 3

The top left panel of Figure 17 shows a plot of esti-mated dprime as a function of true d prime Estimates from the hi-erarchical model are closer to the diagonal than are theconventional estimates indicating more accurate esti-mation Table 3 also provides efficiency information thehierarchical model is 33 more efficient than the con-ventional estimates (this amount is typical of repeatedruns of the same simulation) The reason for this gain isevident in the top right panel of Figure 17 The extremeestimates in the conventional method which most likely

c fi m i m⎡⎣ ⎤⎦ = minus ⎡⎣ ⎤⎦

prime⎡⎣ ⎤⎦ = ⎡⎣ ⎤⎦ minus ⎡⎣ ⎤⎦d h fi m i m m

ConventionalHierarchical

Participant 5 Participant 15

True Probability Conventional Estimates

Lag Lag

5 6 7 8 9 5 6 7 8 9

0 10 20 30 40 0 10 20 30 40

Aut

ocor

rela

tion

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

9

8

7

6

5

8

4

0

9

8

7

6

5

8

4

0

Figure 16 Modeling the probability of success across several partici-pants (Top left) Probability estimates as functions of true probabilitythe triangles represent estimator p0 and the circles represent posteriormeans of pi from the hierarchical model (Top right) Hierarchical esti-mates as a function of p0 (Bottom) Autocorrelation functions of the pos-terior probability estimates for 2 participants from the hierarchicalmodel The degree of autocorrelation is typical of the other participants

592 ROUDER AND LU

reflect sample noise are shrunk to more moderate esti-mates in the hierarchical model

SIGNAL DETECTION WITH ITEMVARIABILITY

In the introduction we stressed that Bayesian hierar-chical modeling may provide a solution for the statisti-cal problems associated with unmodeled variability innonlinear contexts In all of the previous examples thegains from hierarchical modeling were in better effi-ciency Conventional analysis although not as accurateas Bayesian analysis certainly yielded consistent if notunbiased estimates In this section item variability isadded to signal detection Conventional analyses inwhich scores are aggregated over items result in incon-sistent estimates characterized by asymptotic bias Inthis section we show that in contrast to conventionalanalysis Bayesian hierarchical models provide consis-tent and accurate estimation

Our route to the model is a bit circuitous Our first at-tempt based on a straightforward extension of the pre-vious techniques fails because of an excessive amountof autocorrelation The solution to this failure is to use apowerful and recently developed sampling schemeblocked sampling (Roberts amp Sahu 1997) First we willproceed with the straightforward but problematic exten-sion Then we will introduce blocked sampling and showhow it leads to tractable analysis

Straightforward ExtensionIt is straightforward to specify models that include

item effects The previous notation is here expanded byindexing hits and false alarms by both participants and

items Let yij(h) and yij

(f ) denote the response of the ith per-son to the jth item Values of yij

(h) and yij(f ) are dichoto-

mous indicating whether the response was old or newLet pij

(h) and pij(f ) be true probabilities that yij

(h) 1 andyij

(f ) 1 respectively that is these probabilities are truehit and false alarm rates for a given participant-by-itemcombination respectively Let hij and fij be the probittransforms [hij Φ1( pij

(h)) fij Φ1( pij(f ))] The model

is given by

(42)

and

(43)

Linear models are placed on hij and fij

hij μh αi γj (44)

and

fij μf βi δj (45)

Parameters μh and μf correspond to the grand means ofthe hit and false alarm rates respectively Parameters ααand ββ denote participant effects on hits and false alarms respectively parameters γγ and δδ denote item effects onhits and false alarms respectively These participant anditem effects are assumed to be zero-centered random ef-fects Grand mean signal detection parameters denotedd prime and c are given by

d prime μh μf (46)

and

c μf (47)

y fij ij( ) f ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

y hij ij( )h ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

Table 3Data and Estimates of Sensitivity

FalseParticipant True d prime Hits Alarms Hierarchical Conventional

1 1238 35 7 1608 16052 1239 38 17 1307 11193 1325 38 11 1554 14784 1330 33 16 1146 08805 1342 40 19 1324 11476 1405 39 8 1730 17677 1424 37 12 1466 13508 1460 27 5 1417 13829 1559 35 9 1507 1440

10 1584 37 9 1599 155911 1591 33 8 1486 140712 1624 40 7 1817 192213 1634 35 11 1427 129714 1782 43 13 1687 172415 1894 33 3 1749 196716 1962 45 12 1839 198817 1967 45 4 2235 268718 2124 39 6 1826 194719 2270 45 5 2172 256320 2337 46 6 2186 2580

RMSE 183 275Data are numbers of signal responses in 50 trials

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 17: Rouder & Lu (2005)

HIERARCHICAL BAYESIAN MODELS 589

participantrsquos true ability zi is simply a sample from thisnormal The number of successes is modeled as

(32)

with a parent distribution (first-stage prior) given by

(33)

In this case parameters μ and σ 2 describe the locationand variance of the parent distribution and reflect group-level properties Priors are needed on μ and σ 2 (second-stage priors) Suitable choices are

(34)

andσ 2 ~ Inverse Gamma (a b) (35)

Values of μ0 σ 02 a and b must be specif ied before

analysisWe performed an analysis on hypothetical data in order

to demonstrate the hierarchical model Table 2 shows hy-pothetical data for a set of 20 participants performing 50trials These success frequencies were generated from abinomial but each individual had a unique true proba-bility of success These true probabilities2 which rangedbetween 59 and 80 are shown in the second column ofTable 2 Parameter estimators p0 (Equation 12) and p2(Equation 14) are shown

All of the theoretical results needed to derive the con-ditional posterior distributions have already been pre-

sented The Albert and Chib (1995) analysis for this ap-plication is provided in Appendix B A coded version ofMCMC for this model is presented in Appendix C

The MCMC chain was run for 10000 iterations Pri-ors were set to be fairly diffuse (μ0 0 σ 0

2 1000 a b 1) All parameters converged to steady-state behav-ior quickly (burn-in was conservative at 100 iterations)Autocorrelation of samples of pi for 2 select participantsis shown in the bottom panels of Figure 16 The auto-correlation in these plots was typical of other param-eters Autocorrelation for this model was no worse thanthat for the nonhierarchical Bayesian model of the pre-ceding section The top left panel of Figure 16 showsposterior means of the probability of success as a func-tion of the true probability of success for the hierarchi-cal model (filled circles) Also included are classical estimates from p0 (filled circles) Points from the hier-archical model are on average closer to the diagonalAs is indicated in Table 2 the hierarchical model haslower root mean squared error than does its frequentistcounterpart p0 The top right panel shows the hierarchi-cal estimates as a function of p0 The effect of the hier-archical structure is clear The extreme estimates in thefrequentist approach are moderated in the Bayesiananalysis

Estimator p2 also moderates extreme values The ten-dency is to pull them toward the middle specifically es-timates are closer to 5 than they are with p0 For the dataof Table 2 the difference in efficiency between p2 andp0 is exceedingly small (about 0001) The reason thatthere was little gain for p2 is that it pulled estimates to-ward a value of 5 which is not the mean of the trueprobabilities The mean was p 69 and for this valueand these sample sizes estimators p2 and p0 have aboutthe same efficiency The hierarchical estimator is supe-

μ μ σ~ Normal 02( )

zi | μ σ μ σ2 ind~ Normal 2( )

y z N zi i i i| ind~ Binomial Φ( )⎡⎣ ⎤⎦

Figure 14 (Top) Histograms of the post-burn-in samples of zand p respectively for 7 successes out of 10 trials The line fit forp is the true posterior a beta distribution with a primeprime 8 and bprimeprime 4(Bottom) Autocorrelation function for z There is a small degreeof autocorrelation

Figure 15 Example of how a parent distribution affects esti-mation The outlying observation is incompatible with the parentdistribution Bayesian estimation allows for the influence of theparent as a prior The resulting estimate is shifted upward re-ducing error

590 ROUDER AND LU

rior to these nonhierarchical estimators because it pullsoutlying estimates toward the mean value p Of coursethe advantage of the hierarchical model is a function ofsample size With increasing sample sizes the amountof pulling decreases In the limit all three estimatorsconverge to the true individualrsquos probability

A HIERARCHICAL SIGNAL DETECTION MODEL

Our ultimate goal is a signal detection model that ac-counts for both participant and item variability As a pre-cursor we present a hierarchical signal detection modelwithout item variability the model will be expanded inthe next section to include item variability The modelhere is applicable to psychophysical experiments in whichsignal and noise trials are identical replicates respectively

Signal detection parameters are simple subtractions ofprobit-transformed probabilities Let d primei and ci denotesignal detection parameters for the ith participant

and

where pi(h) and p

i(f ) are individualsrsquo true hit and false

alarm probabilities respectively Let hi and fi be the pro-bit transforms hi1( p i

(h)) and fi1( pi(f )) Then

sensitivity and criteria are given by

d primei hi fiand

ci fi

The tactic we follow is to place hierarchical models on hand f rather than on signal detection parameters directlyBecause the relationship between signal detection pa-rameters and probit-transformed probabilities is linearaccurate estimation of hi and fi results in accurate esti-mation of d primei and ci

The model is developed as follows Let Ni and Si de-note the number of noise and signal trials given to the ithparticipant Let y

i(h) and y

i(f ) denote the the number of

hits and false alarms respectively The model for yi(h) and

yi(f ) is given by

and

It is assumed that each participantrsquos parameters aredrawn from parent distributions These parent distribu-tions serve as the first stage of the hierarchical prior andare given by

(36)

and

(37)

Second-stage priors are placed on parent distribution pa-rameters (μk σ k

2) as

(38)

and

(39)σ k k ka b k2 ~ Inverse Gamma h f( ) =

μk k ku v k~ Normal h f( ) =

f iind~ Normal f fμ σ 2( )

hiind~ Normal h hμ σ 2( )

y f N fi i i i( ) | f ind

~ Binomial Φ( )⎡⎣ ⎤⎦

y h S hi i i i( ) | h ind

~ Binomial Φ( )⎡⎣ ⎤⎦

c pi i= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pi i iΦ Φ1 1( ) ( )h f

Table 2Data and Estimates of a Probability of Success

Participant True Prob Data p0 p2 Hierarchical

1 587 33 66 654 6722 621 28 56 558 6153 633 34 68 673 6834 650 29 58 577 6275 659 32 64 635 6606 665 37 74 731 7167 667 30 60 596 6388 667 29 58 577 6279 687 34 68 673 684

10 693 34 68 673 68211 696 34 68 673 68312 701 37 74 731 71513 713 38 76 750 72814 734 38 76 750 72715 736 37 74 731 71716 738 33 66 654 67217 743 37 74 731 71618 759 33 66 654 67219 786 38 76 750 72820 803 42 84 827 771

RMSE 053 053 041Data are numbers of successes in 50 trials

HIERARCHICAL BAYESIAN MODELS 591

Analysis takes place in two steps The first is to obtainposterior samples from hi and fi for each participantThese are probit-transformed parameters that may be es-timated following the exact procedure in the previoussection Separate runs are made for signal trials (produc-ing estimates of hi) and for noise trials (producing esti-mates of fi) These posterior samples are denoted [hi] and[ fi] respectively The second step is to use these posteriorsamples to estimate posterior distributions of individuald primei and ci This is easily accomplished For all samples

(40)

and(41)

Hypothetical data and parameter estimates are shownin Table 3 Individual variability in the data was imple-mented by sampling each participantrsquos d prime and c from alog-normal and from a normal distribution respectivelyTrue d prime values are shown and they range from 123 to234 From these true values of d prime and c true individualhit and false alarm probabilities were computed and in-dividual hit and false alarm frequencies were sampledfrom a binomial Hit and false alarm frequencies wereproduced in this manner for 20 hypothetical individualseach observing 50 signal and 50 noise trials

A conventional sensitivity estimate for each individ-ual was obtained using the formula

Φ1(Hit rate) Φ1(False alarm rate)

The resulting estimates are shown in the last column ofTable 3 labeled Conventional The hierarchical signaldetection model was analyzed with diffuse prior param-eters ah af bh bf 01 uh uf 0 and vh vf 1000 The chain was run for 10000 iterations with thefirst 500 serving as burn-in This choice of burn-in pe-riod is very conservative Autocorrelation for sensitivityfor 2 select participants is displayed in the bottom pan-els of Figure 17 this degree of autocorrelation was typ-ical for all parameters Autocorrelation was about thesame as in the previous application and the resulting es-timates are shown in the next-to-last column of Table 3

The top left panel of Figure 17 shows a plot of esti-mated dprime as a function of true d prime Estimates from the hi-erarchical model are closer to the diagonal than are theconventional estimates indicating more accurate esti-mation Table 3 also provides efficiency information thehierarchical model is 33 more efficient than the con-ventional estimates (this amount is typical of repeatedruns of the same simulation) The reason for this gain isevident in the top right panel of Figure 17 The extremeestimates in the conventional method which most likely

c fi m i m⎡⎣ ⎤⎦ = minus ⎡⎣ ⎤⎦

prime⎡⎣ ⎤⎦ = ⎡⎣ ⎤⎦ minus ⎡⎣ ⎤⎦d h fi m i m m

ConventionalHierarchical

Participant 5 Participant 15

True Probability Conventional Estimates

Lag Lag

5 6 7 8 9 5 6 7 8 9

0 10 20 30 40 0 10 20 30 40

Aut

ocor

rela

tion

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

9

8

7

6

5

8

4

0

9

8

7

6

5

8

4

0

Figure 16 Modeling the probability of success across several partici-pants (Top left) Probability estimates as functions of true probabilitythe triangles represent estimator p0 and the circles represent posteriormeans of pi from the hierarchical model (Top right) Hierarchical esti-mates as a function of p0 (Bottom) Autocorrelation functions of the pos-terior probability estimates for 2 participants from the hierarchicalmodel The degree of autocorrelation is typical of the other participants

592 ROUDER AND LU

reflect sample noise are shrunk to more moderate esti-mates in the hierarchical model

SIGNAL DETECTION WITH ITEMVARIABILITY

In the introduction we stressed that Bayesian hierar-chical modeling may provide a solution for the statisti-cal problems associated with unmodeled variability innonlinear contexts In all of the previous examples thegains from hierarchical modeling were in better effi-ciency Conventional analysis although not as accurateas Bayesian analysis certainly yielded consistent if notunbiased estimates In this section item variability isadded to signal detection Conventional analyses inwhich scores are aggregated over items result in incon-sistent estimates characterized by asymptotic bias Inthis section we show that in contrast to conventionalanalysis Bayesian hierarchical models provide consis-tent and accurate estimation

Our route to the model is a bit circuitous Our first at-tempt based on a straightforward extension of the pre-vious techniques fails because of an excessive amountof autocorrelation The solution to this failure is to use apowerful and recently developed sampling schemeblocked sampling (Roberts amp Sahu 1997) First we willproceed with the straightforward but problematic exten-sion Then we will introduce blocked sampling and showhow it leads to tractable analysis

Straightforward ExtensionIt is straightforward to specify models that include

item effects The previous notation is here expanded byindexing hits and false alarms by both participants and

items Let yij(h) and yij

(f ) denote the response of the ith per-son to the jth item Values of yij

(h) and yij(f ) are dichoto-

mous indicating whether the response was old or newLet pij

(h) and pij(f ) be true probabilities that yij

(h) 1 andyij

(f ) 1 respectively that is these probabilities are truehit and false alarm rates for a given participant-by-itemcombination respectively Let hij and fij be the probittransforms [hij Φ1( pij

(h)) fij Φ1( pij(f ))] The model

is given by

(42)

and

(43)

Linear models are placed on hij and fij

hij μh αi γj (44)

and

fij μf βi δj (45)

Parameters μh and μf correspond to the grand means ofthe hit and false alarm rates respectively Parameters ααand ββ denote participant effects on hits and false alarms respectively parameters γγ and δδ denote item effects onhits and false alarms respectively These participant anditem effects are assumed to be zero-centered random ef-fects Grand mean signal detection parameters denotedd prime and c are given by

d prime μh μf (46)

and

c μf (47)

y fij ij( ) f ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

y hij ij( )h ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

Table 3Data and Estimates of Sensitivity

FalseParticipant True d prime Hits Alarms Hierarchical Conventional

1 1238 35 7 1608 16052 1239 38 17 1307 11193 1325 38 11 1554 14784 1330 33 16 1146 08805 1342 40 19 1324 11476 1405 39 8 1730 17677 1424 37 12 1466 13508 1460 27 5 1417 13829 1559 35 9 1507 1440

10 1584 37 9 1599 155911 1591 33 8 1486 140712 1624 40 7 1817 192213 1634 35 11 1427 129714 1782 43 13 1687 172415 1894 33 3 1749 196716 1962 45 12 1839 198817 1967 45 4 2235 268718 2124 39 6 1826 194719 2270 45 5 2172 256320 2337 46 6 2186 2580

RMSE 183 275Data are numbers of signal responses in 50 trials

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 18: Rouder & Lu (2005)

590 ROUDER AND LU

rior to these nonhierarchical estimators because it pullsoutlying estimates toward the mean value p Of coursethe advantage of the hierarchical model is a function ofsample size With increasing sample sizes the amountof pulling decreases In the limit all three estimatorsconverge to the true individualrsquos probability

A HIERARCHICAL SIGNAL DETECTION MODEL

Our ultimate goal is a signal detection model that ac-counts for both participant and item variability As a pre-cursor we present a hierarchical signal detection modelwithout item variability the model will be expanded inthe next section to include item variability The modelhere is applicable to psychophysical experiments in whichsignal and noise trials are identical replicates respectively

Signal detection parameters are simple subtractions ofprobit-transformed probabilities Let d primei and ci denotesignal detection parameters for the ith participant

and

where pi(h) and p

i(f ) are individualsrsquo true hit and false

alarm probabilities respectively Let hi and fi be the pro-bit transforms hi1( p i

(h)) and fi1( pi(f )) Then

sensitivity and criteria are given by

d primei hi fiand

ci fi

The tactic we follow is to place hierarchical models on hand f rather than on signal detection parameters directlyBecause the relationship between signal detection pa-rameters and probit-transformed probabilities is linearaccurate estimation of hi and fi results in accurate esti-mation of d primei and ci

The model is developed as follows Let Ni and Si de-note the number of noise and signal trials given to the ithparticipant Let y

i(h) and y

i(f ) denote the the number of

hits and false alarms respectively The model for yi(h) and

yi(f ) is given by

and

It is assumed that each participantrsquos parameters aredrawn from parent distributions These parent distribu-tions serve as the first stage of the hierarchical prior andare given by

(36)

and

(37)

Second-stage priors are placed on parent distribution pa-rameters (μk σ k

2) as

(38)

and

(39)σ k k ka b k2 ~ Inverse Gamma h f( ) =

μk k ku v k~ Normal h f( ) =

f iind~ Normal f fμ σ 2( )

hiind~ Normal h hμ σ 2( )

y f N fi i i i( ) | f ind

~ Binomial Φ( )⎡⎣ ⎤⎦

y h S hi i i i( ) | h ind

~ Binomial Φ( )⎡⎣ ⎤⎦

c pi i= minus ( )minusΦ 1 ( ) f

prime = ( ) minus ( )minus minusd p pi i iΦ Φ1 1( ) ( )h f

Table 2Data and Estimates of a Probability of Success

Participant True Prob Data p0 p2 Hierarchical

1 587 33 66 654 6722 621 28 56 558 6153 633 34 68 673 6834 650 29 58 577 6275 659 32 64 635 6606 665 37 74 731 7167 667 30 60 596 6388 667 29 58 577 6279 687 34 68 673 684

10 693 34 68 673 68211 696 34 68 673 68312 701 37 74 731 71513 713 38 76 750 72814 734 38 76 750 72715 736 37 74 731 71716 738 33 66 654 67217 743 37 74 731 71618 759 33 66 654 67219 786 38 76 750 72820 803 42 84 827 771

RMSE 053 053 041Data are numbers of successes in 50 trials

HIERARCHICAL BAYESIAN MODELS 591

Analysis takes place in two steps The first is to obtainposterior samples from hi and fi for each participantThese are probit-transformed parameters that may be es-timated following the exact procedure in the previoussection Separate runs are made for signal trials (produc-ing estimates of hi) and for noise trials (producing esti-mates of fi) These posterior samples are denoted [hi] and[ fi] respectively The second step is to use these posteriorsamples to estimate posterior distributions of individuald primei and ci This is easily accomplished For all samples

(40)

and(41)

Hypothetical data and parameter estimates are shownin Table 3 Individual variability in the data was imple-mented by sampling each participantrsquos d prime and c from alog-normal and from a normal distribution respectivelyTrue d prime values are shown and they range from 123 to234 From these true values of d prime and c true individualhit and false alarm probabilities were computed and in-dividual hit and false alarm frequencies were sampledfrom a binomial Hit and false alarm frequencies wereproduced in this manner for 20 hypothetical individualseach observing 50 signal and 50 noise trials

A conventional sensitivity estimate for each individ-ual was obtained using the formula

Φ1(Hit rate) Φ1(False alarm rate)

The resulting estimates are shown in the last column ofTable 3 labeled Conventional The hierarchical signaldetection model was analyzed with diffuse prior param-eters ah af bh bf 01 uh uf 0 and vh vf 1000 The chain was run for 10000 iterations with thefirst 500 serving as burn-in This choice of burn-in pe-riod is very conservative Autocorrelation for sensitivityfor 2 select participants is displayed in the bottom pan-els of Figure 17 this degree of autocorrelation was typ-ical for all parameters Autocorrelation was about thesame as in the previous application and the resulting es-timates are shown in the next-to-last column of Table 3

The top left panel of Figure 17 shows a plot of esti-mated dprime as a function of true d prime Estimates from the hi-erarchical model are closer to the diagonal than are theconventional estimates indicating more accurate esti-mation Table 3 also provides efficiency information thehierarchical model is 33 more efficient than the con-ventional estimates (this amount is typical of repeatedruns of the same simulation) The reason for this gain isevident in the top right panel of Figure 17 The extremeestimates in the conventional method which most likely

c fi m i m⎡⎣ ⎤⎦ = minus ⎡⎣ ⎤⎦

prime⎡⎣ ⎤⎦ = ⎡⎣ ⎤⎦ minus ⎡⎣ ⎤⎦d h fi m i m m

ConventionalHierarchical

Participant 5 Participant 15

True Probability Conventional Estimates

Lag Lag

5 6 7 8 9 5 6 7 8 9

0 10 20 30 40 0 10 20 30 40

Aut

ocor

rela

tion

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

9

8

7

6

5

8

4

0

9

8

7

6

5

8

4

0

Figure 16 Modeling the probability of success across several partici-pants (Top left) Probability estimates as functions of true probabilitythe triangles represent estimator p0 and the circles represent posteriormeans of pi from the hierarchical model (Top right) Hierarchical esti-mates as a function of p0 (Bottom) Autocorrelation functions of the pos-terior probability estimates for 2 participants from the hierarchicalmodel The degree of autocorrelation is typical of the other participants

592 ROUDER AND LU

reflect sample noise are shrunk to more moderate esti-mates in the hierarchical model

SIGNAL DETECTION WITH ITEMVARIABILITY

In the introduction we stressed that Bayesian hierar-chical modeling may provide a solution for the statisti-cal problems associated with unmodeled variability innonlinear contexts In all of the previous examples thegains from hierarchical modeling were in better effi-ciency Conventional analysis although not as accurateas Bayesian analysis certainly yielded consistent if notunbiased estimates In this section item variability isadded to signal detection Conventional analyses inwhich scores are aggregated over items result in incon-sistent estimates characterized by asymptotic bias Inthis section we show that in contrast to conventionalanalysis Bayesian hierarchical models provide consis-tent and accurate estimation

Our route to the model is a bit circuitous Our first at-tempt based on a straightforward extension of the pre-vious techniques fails because of an excessive amountof autocorrelation The solution to this failure is to use apowerful and recently developed sampling schemeblocked sampling (Roberts amp Sahu 1997) First we willproceed with the straightforward but problematic exten-sion Then we will introduce blocked sampling and showhow it leads to tractable analysis

Straightforward ExtensionIt is straightforward to specify models that include

item effects The previous notation is here expanded byindexing hits and false alarms by both participants and

items Let yij(h) and yij

(f ) denote the response of the ith per-son to the jth item Values of yij

(h) and yij(f ) are dichoto-

mous indicating whether the response was old or newLet pij

(h) and pij(f ) be true probabilities that yij

(h) 1 andyij

(f ) 1 respectively that is these probabilities are truehit and false alarm rates for a given participant-by-itemcombination respectively Let hij and fij be the probittransforms [hij Φ1( pij

(h)) fij Φ1( pij(f ))] The model

is given by

(42)

and

(43)

Linear models are placed on hij and fij

hij μh αi γj (44)

and

fij μf βi δj (45)

Parameters μh and μf correspond to the grand means ofthe hit and false alarm rates respectively Parameters ααand ββ denote participant effects on hits and false alarms respectively parameters γγ and δδ denote item effects onhits and false alarms respectively These participant anditem effects are assumed to be zero-centered random ef-fects Grand mean signal detection parameters denotedd prime and c are given by

d prime μh μf (46)

and

c μf (47)

y fij ij( ) f ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

y hij ij( )h ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

Table 3Data and Estimates of Sensitivity

FalseParticipant True d prime Hits Alarms Hierarchical Conventional

1 1238 35 7 1608 16052 1239 38 17 1307 11193 1325 38 11 1554 14784 1330 33 16 1146 08805 1342 40 19 1324 11476 1405 39 8 1730 17677 1424 37 12 1466 13508 1460 27 5 1417 13829 1559 35 9 1507 1440

10 1584 37 9 1599 155911 1591 33 8 1486 140712 1624 40 7 1817 192213 1634 35 11 1427 129714 1782 43 13 1687 172415 1894 33 3 1749 196716 1962 45 12 1839 198817 1967 45 4 2235 268718 2124 39 6 1826 194719 2270 45 5 2172 256320 2337 46 6 2186 2580

RMSE 183 275Data are numbers of signal responses in 50 trials

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 19: Rouder & Lu (2005)

HIERARCHICAL BAYESIAN MODELS 591

Analysis takes place in two steps The first is to obtainposterior samples from hi and fi for each participantThese are probit-transformed parameters that may be es-timated following the exact procedure in the previoussection Separate runs are made for signal trials (produc-ing estimates of hi) and for noise trials (producing esti-mates of fi) These posterior samples are denoted [hi] and[ fi] respectively The second step is to use these posteriorsamples to estimate posterior distributions of individuald primei and ci This is easily accomplished For all samples

(40)

and(41)

Hypothetical data and parameter estimates are shownin Table 3 Individual variability in the data was imple-mented by sampling each participantrsquos d prime and c from alog-normal and from a normal distribution respectivelyTrue d prime values are shown and they range from 123 to234 From these true values of d prime and c true individualhit and false alarm probabilities were computed and in-dividual hit and false alarm frequencies were sampledfrom a binomial Hit and false alarm frequencies wereproduced in this manner for 20 hypothetical individualseach observing 50 signal and 50 noise trials

A conventional sensitivity estimate for each individ-ual was obtained using the formula

Φ1(Hit rate) Φ1(False alarm rate)

The resulting estimates are shown in the last column ofTable 3 labeled Conventional The hierarchical signaldetection model was analyzed with diffuse prior param-eters ah af bh bf 01 uh uf 0 and vh vf 1000 The chain was run for 10000 iterations with thefirst 500 serving as burn-in This choice of burn-in pe-riod is very conservative Autocorrelation for sensitivityfor 2 select participants is displayed in the bottom pan-els of Figure 17 this degree of autocorrelation was typ-ical for all parameters Autocorrelation was about thesame as in the previous application and the resulting es-timates are shown in the next-to-last column of Table 3

The top left panel of Figure 17 shows a plot of esti-mated dprime as a function of true d prime Estimates from the hi-erarchical model are closer to the diagonal than are theconventional estimates indicating more accurate esti-mation Table 3 also provides efficiency information thehierarchical model is 33 more efficient than the con-ventional estimates (this amount is typical of repeatedruns of the same simulation) The reason for this gain isevident in the top right panel of Figure 17 The extremeestimates in the conventional method which most likely

c fi m i m⎡⎣ ⎤⎦ = minus ⎡⎣ ⎤⎦

prime⎡⎣ ⎤⎦ = ⎡⎣ ⎤⎦ minus ⎡⎣ ⎤⎦d h fi m i m m

ConventionalHierarchical

Participant 5 Participant 15

True Probability Conventional Estimates

Lag Lag

5 6 7 8 9 5 6 7 8 9

0 10 20 30 40 0 10 20 30 40

Aut

ocor

rela

tion

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

9

8

7

6

5

8

4

0

9

8

7

6

5

8

4

0

Figure 16 Modeling the probability of success across several partici-pants (Top left) Probability estimates as functions of true probabilitythe triangles represent estimator p0 and the circles represent posteriormeans of pi from the hierarchical model (Top right) Hierarchical esti-mates as a function of p0 (Bottom) Autocorrelation functions of the pos-terior probability estimates for 2 participants from the hierarchicalmodel The degree of autocorrelation is typical of the other participants

592 ROUDER AND LU

reflect sample noise are shrunk to more moderate esti-mates in the hierarchical model

SIGNAL DETECTION WITH ITEMVARIABILITY

In the introduction we stressed that Bayesian hierar-chical modeling may provide a solution for the statisti-cal problems associated with unmodeled variability innonlinear contexts In all of the previous examples thegains from hierarchical modeling were in better effi-ciency Conventional analysis although not as accurateas Bayesian analysis certainly yielded consistent if notunbiased estimates In this section item variability isadded to signal detection Conventional analyses inwhich scores are aggregated over items result in incon-sistent estimates characterized by asymptotic bias Inthis section we show that in contrast to conventionalanalysis Bayesian hierarchical models provide consis-tent and accurate estimation

Our route to the model is a bit circuitous Our first at-tempt based on a straightforward extension of the pre-vious techniques fails because of an excessive amountof autocorrelation The solution to this failure is to use apowerful and recently developed sampling schemeblocked sampling (Roberts amp Sahu 1997) First we willproceed with the straightforward but problematic exten-sion Then we will introduce blocked sampling and showhow it leads to tractable analysis

Straightforward ExtensionIt is straightforward to specify models that include

item effects The previous notation is here expanded byindexing hits and false alarms by both participants and

items Let yij(h) and yij

(f ) denote the response of the ith per-son to the jth item Values of yij

(h) and yij(f ) are dichoto-

mous indicating whether the response was old or newLet pij

(h) and pij(f ) be true probabilities that yij

(h) 1 andyij

(f ) 1 respectively that is these probabilities are truehit and false alarm rates for a given participant-by-itemcombination respectively Let hij and fij be the probittransforms [hij Φ1( pij

(h)) fij Φ1( pij(f ))] The model

is given by

(42)

and

(43)

Linear models are placed on hij and fij

hij μh αi γj (44)

and

fij μf βi δj (45)

Parameters μh and μf correspond to the grand means ofthe hit and false alarm rates respectively Parameters ααand ββ denote participant effects on hits and false alarms respectively parameters γγ and δδ denote item effects onhits and false alarms respectively These participant anditem effects are assumed to be zero-centered random ef-fects Grand mean signal detection parameters denotedd prime and c are given by

d prime μh μf (46)

and

c μf (47)

y fij ij( ) f ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

y hij ij( )h ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

Table 3Data and Estimates of Sensitivity

FalseParticipant True d prime Hits Alarms Hierarchical Conventional

1 1238 35 7 1608 16052 1239 38 17 1307 11193 1325 38 11 1554 14784 1330 33 16 1146 08805 1342 40 19 1324 11476 1405 39 8 1730 17677 1424 37 12 1466 13508 1460 27 5 1417 13829 1559 35 9 1507 1440

10 1584 37 9 1599 155911 1591 33 8 1486 140712 1624 40 7 1817 192213 1634 35 11 1427 129714 1782 43 13 1687 172415 1894 33 3 1749 196716 1962 45 12 1839 198817 1967 45 4 2235 268718 2124 39 6 1826 194719 2270 45 5 2172 256320 2337 46 6 2186 2580

RMSE 183 275Data are numbers of signal responses in 50 trials

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 20: Rouder & Lu (2005)

592 ROUDER AND LU

reflect sample noise are shrunk to more moderate esti-mates in the hierarchical model

SIGNAL DETECTION WITH ITEMVARIABILITY

In the introduction we stressed that Bayesian hierar-chical modeling may provide a solution for the statisti-cal problems associated with unmodeled variability innonlinear contexts In all of the previous examples thegains from hierarchical modeling were in better effi-ciency Conventional analysis although not as accurateas Bayesian analysis certainly yielded consistent if notunbiased estimates In this section item variability isadded to signal detection Conventional analyses inwhich scores are aggregated over items result in incon-sistent estimates characterized by asymptotic bias Inthis section we show that in contrast to conventionalanalysis Bayesian hierarchical models provide consis-tent and accurate estimation

Our route to the model is a bit circuitous Our first at-tempt based on a straightforward extension of the pre-vious techniques fails because of an excessive amountof autocorrelation The solution to this failure is to use apowerful and recently developed sampling schemeblocked sampling (Roberts amp Sahu 1997) First we willproceed with the straightforward but problematic exten-sion Then we will introduce blocked sampling and showhow it leads to tractable analysis

Straightforward ExtensionIt is straightforward to specify models that include

item effects The previous notation is here expanded byindexing hits and false alarms by both participants and

items Let yij(h) and yij

(f ) denote the response of the ith per-son to the jth item Values of yij

(h) and yij(f ) are dichoto-

mous indicating whether the response was old or newLet pij

(h) and pij(f ) be true probabilities that yij

(h) 1 andyij

(f ) 1 respectively that is these probabilities are truehit and false alarm rates for a given participant-by-itemcombination respectively Let hij and fij be the probittransforms [hij Φ1( pij

(h)) fij Φ1( pij(f ))] The model

is given by

(42)

and

(43)

Linear models are placed on hij and fij

hij μh αi γj (44)

and

fij μf βi δj (45)

Parameters μh and μf correspond to the grand means ofthe hit and false alarm rates respectively Parameters ααand ββ denote participant effects on hits and false alarms respectively parameters γγ and δδ denote item effects onhits and false alarms respectively These participant anditem effects are assumed to be zero-centered random ef-fects Grand mean signal detection parameters denotedd prime and c are given by

d prime μh μf (46)

and

c μf (47)

y fij ij( ) f ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

y hij ij( )h ind

~ Bernoulli Φ( )⎡⎣ ⎤⎦

Table 3Data and Estimates of Sensitivity

FalseParticipant True d prime Hits Alarms Hierarchical Conventional

1 1238 35 7 1608 16052 1239 38 17 1307 11193 1325 38 11 1554 14784 1330 33 16 1146 08805 1342 40 19 1324 11476 1405 39 8 1730 17677 1424 37 12 1466 13508 1460 27 5 1417 13829 1559 35 9 1507 1440

10 1584 37 9 1599 155911 1591 33 8 1486 140712 1624 40 7 1817 192213 1634 35 11 1427 129714 1782 43 13 1687 172415 1894 33 3 1749 196716 1962 45 12 1839 198817 1967 45 4 2235 268718 2124 39 6 1826 194719 2270 45 5 2172 256320 2337 46 6 2186 2580

RMSE 183 275Data are numbers of signal responses in 50 trials

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 21: Rouder & Lu (2005)

HIERARCHICAL BAYESIAN MODELS 593

Participant-specific signal detection parameters de-noted d primei and ci are given by

d primei μh μf αi βi (48)

andci μf βi (49)

Item-specific signal detection parameters denoted d prime jand c j are given by

d prime j μh μf γj δj (50)

andc j μf δj (51)

Priors on (μh μf) are independent normals with meanμ0k and variance σ 2

0k where k indicates hits or falsealarms In the implementation we set μ0 k 0 andσ 2

0 k 500 a sufficiently large number The priors on(αα ββ γγ δδ ) are hierarchical The first stages (parent dis-tributions) are normals centered around 0

(52)

(53)

(54)

and

(55)

Second-stage priors on all four variances are inverse-gamma distributions In application the parameters ofthe inverse gamma were set to (a 01 b 01) Themodel was evaluated with the same simulation from theintroduction (Equations 8ndash11) Twenty hypothetical par-ticipants observed 50 signal and 50 noise trials Data inthe simulation were generated in accordance with themirror effect principle Item and participant increases insensitivity corresponded to both increases in hit proba-bilities and decreases in false alarm probabilities Al-though the data were generated with a mirror effect this

δ σδjind~ Normal 0 2 ( )

γ σγjind~ Normal 0 2 ( )

β σβiind~ Normal 0 2 ( )

α σαiind~ Normal 0 2 ( )

Figure 18 Autocorrelation of overall sensitivity (μμh μμ f ) in azero-centered random-effects model (Left) Values of the chain(Right) Autocorrelation function This degree of autocorrelationis problematic

ConventionalHierarchical

Participant 5 Participant 15

True Sensitivity Conventional Estimates

Lag0 10 20 30 40

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag0 10 20 30 40

Aut

ocor

rela

tion

05 10 15 20 25 30 05 10 15 20 25 30

30

25

20

15

10

05

30

25

20

15

10

05

8

4

0

8

4

0

Figure 17 Modeling signal detection across several participants (Top left)Parameter estimates from the conventional approach and from the hierarchi-cal model as functions of true sensitivity (Top right) Hierarchical estimates asa function of the conventional ones (Bottom) Autocorrelation functions forposterior samples of d primeprimei for 2 participants

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 22: Rouder & Lu (2005)

594 ROUDER AND LU

effect is not assumed in model analysis Participant anditem effects on hits and false alarms are estimated inde-pendently

Figure 18 shows plots of the samples of overall sensi-tivity (d prime ) The chain has a much larger degree of auto-correlation than was previously observed and this auto-correlation greatly complicates analysis If a short chainis used the resulting sample will not reflect the true pos-terior Two strategies can mitigate autocorrelation Thefirst is to run long chains and thin them For this appli-cation the Gibbs sampling is sufficiently quick that thechain can be run for millions of cycles in the course of aday With exceptionally long chains convergence oc-curs In such cases researchers do not use all iterationsbut instead skip a fixed numbermdashfor example discard-ing all but every 50th observation The correlation be-tween sample m and sample m 50 is greatly attenuatedThe choice of the multiple 50 in the example above is notcompletely arbitrary because it should reflect the amountof autocorrelation Chains with larger amounts of autocor-relation should implicate larger multiples for thinning

Although the brute-force method of running long chainsand thinning is practical in this context it may be im-practical in others For instance we have encountered agreater degree of autocorrelation in Weibull hierarchicalmodels (Lu 2004 Rouder et al 2005) in fact we ob-served correlations spanning tens of thousands of cyclesIn the Weibull application sampling is far slower andrunning chains of hundreds of millions of replicates issimply not feasible Thus we present blocked sampling(Roberts amp Sahu 1997) as a means of mitigating auto-correlation in Gibbs sampling

Blocked SamplingUp to now the Gibbs sampling we have presented may

be termed componentwise For each iteration parametershave been updated one at a time and in turn This type ofsampling is advocated because it is straightforward toimplement and sufficient for many applications It iswell known however that componentwise sampling leadsto autocorrelation in models with zero-centered randomeffects

The autocorrelation problem comes about when theconditional posterior distributions are highly determinedby conditioned-upon parameters and not by the data Inthe hierarchical signal detection model the sample of μhis determined largely by the sum of Σ iαi and Σ jγj Like-wise the sample of each αi is determined largely by thevalues of μh On iteration m the value of [μh]m1 influ-ences each value of [αi]m and the sum of these [αi]m val-ues has a large influence on μh By transitivity the valueof [μh]m1 influences the value of [μh]m that is they arecorrelated This type of correlation is also true of μf andsubsequently for the overall estimates of sensitivity

In blocked sampling sets of parameters are drawnjointly in one step rather than componentwise For themodel on old-item trials (hits) the parameters (μh αα γγ )are sampled together from a multivariate distribution

Let θθh denote the vector of parameters θθh (μh αα γγ )The derivation of conditional posterior distributions fol-lows by multiplying the likelihood by the prior

Each component of θθ has a normal prior Hence theprior f (θθh | σα

2 σ γ2) is a multivariate normal When the

Albert and Chib (1995) alternative is used (Appendix B)the likelihood term becomes that of the latent data wwhich is distributed as a multivariate normal The re-sulting conditional posterior is also a multivariate nor-mal but one with nonzero covariance terms (Lu 2004)The other needed conditional posteriors are those for thevariance terms (σα

2 σγ2) These remain the same as be-

fore they are distributed as inverse-gamma distribu-tions In order to run Gibbs sampling in the blocked format it is necessary to be able to sample from a mul-tivariate normal with an arbitrary covariance matrix afeature that is built in to some statistical packages An al-ternative is to use a Cholesky decomposition to samplea multivariate normal from a collection of univariateones (Ashby 1992) Conditional posteriors for new-itemparameters (μf ββ δδ σ β

2 σδ2) are derived and sampled

analogously The R code for this blocked Gibbs samplermay be found at wwwmissouriedu~pcl

To test blocked sampling in this application data weregenerated in the same manner as before (Equations 8ndash1120 participants observing 50 items with item and partici-pant increases in d prime resulting in increased hit rate proba-bilities and decreased false alarm rate probabilities) Fig-ure 19 shows the results The two top panels showdramatically improved convergence over componentwisesampling (cf Figure 18) Successive samples of sensitiv-ity are virtually uncorrelated The bottom left panelshows parameter estimation from both the conventionalmethod (aggregating hit and false alarm events acrossitems) and the hierarchical model The conventionalmethod has a consistent downward bias of 27 in thissample As was mentioned before this bias is asymptoticand remains even in the limit of increasingly large num-bers of items The hierarchical estimates are much closerto the diagonal and have no detectable overall bias Thehierarchical estimates are 40 more efficient than theirconventional counterparts The bottom right panel showsa comparison of the hierarchical estimates with the con-ventional ones It is clear that the hierarchical model pro-duces higher valued estimates especially for smallertrue values of d prime

There is however a noticeable but small misfit of thehierarchical modelmdasha tendency to overestimate sensi-tivity for participants with smaller true values and tounderestimate it for participants with larger true valuesThis type of misfit may be termed over-shrinkage Wesuspect over-shrinkage comes about from the choice ofpriors We discuss potential modifications of the priorsin the next section The good news is that over-shrinkagedecreases as the number of items is increased All in allthe hierarchical model presented here is a dramatic im-

f f fθθ θθ θθh h h| | | σ σ σ σ σ σα γ α γ α γ2 2 2 2 2 2y y( ) ( ) times (( )

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 23: Rouder & Lu (2005)

HIERARCHICAL BAYESIAN MODELS 595

provement on the conventional practice of aggregatinghits and false alarms across items

The hierarchical model can also be used to explore thenature of participant and item effects The data were gen-erated by mirror effect principles Increases in hit ratescorresponded to decreases in false alarm rates Althoughthis principle was not assumed in analysis the resultingestimates of participant and item effects should show amirror effect The left panel in Figure 20 shows the rela-tionship between participant effects Each point is from

a different participant The y-axis value of the point isαi the participantrsquos hit rate effect the x-axis value is βi the participantrsquos false alarm rate effect The negative cor-relation is expected Participants with higher hit rateshave lower false alarm rates The right panel presents thesame effects for items (γj vs δj) As was expected itemswith higher hit rates have lower false alarm rates

To demonstrate the modelrsquos ability to disentangle dif-ferent types of item and participant effects we performeda second simulation In this case participant effects re-

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

15

10

05

0

30

20

10

0

30

20

10

0

8

4

0

Figure 19 Analysis of a signal detection paradigm with participantand item variability The top row shows that blocked sampling mitigatesautocorrelation The bottom row shows conventional and hierarchicalestimates of participantsrsquo sensitivity as functions of the true sensitivity(left) and hierarchical estimates as a function of the conventional esti-mates (right) True random effects were generated with a mirror effect

ndash6 ndash2 2

ndash

ndash15 ndash05 05 15Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

4

2

0

ndash2

10

0

ndash10

ndash20

Figure 20 Relationship between estimated random effects when true valueswere generated with a mirror effect (Left) Estimated participant effects on hitrate as a function of their effects on false alarm rate reveal a negative correla-tion (Right) Estimated item effects also reveal a negative correlation

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 24: Rouder & Lu (2005)

596 ROUDER AND LU

flected the mirror effect as before Item effects howeverreflected baseline familiarity Some items were assumedto be familiar which resulted in increased hits and falsealarms Others lacked familiarity which resulted in de-creased hits and false alarms This baseline-familiarityeffect was implemented by setting the effect of an itemon hits equal to that on false alarms (γj δj) Overall sen-sitivity results of the simulation are shown in Figure 21

These are highly similar to the previous simulation in that(1) there is little autocorrelation (2) the hierarchical esti-mates are much better than their conventional counter-parts and (3) there is a tendency toward over-shrinkageFigure 22 shows the relationships among the random ef-fects As expected estimates of participant effects arenegatively correlated reflecting a mirror effect Estimatesof item effects are positively correlated reflecting a

ConventionalHierarchical

True Sensitivity Conventional Estimates

Iteration0 5 10 20 30

Aut

ocor

rela

tion

Est

imat

es

Hie

rarc

hica

l Est

imat

es

Lag

0 10 20 30

Sen

sitiv

ity

0 400 800

0 10 20 30

20

10

0

30

20

10

0

30

20

10

0

8

4

0

Figure 21 Analysis of the second simulation of a signal detection par-adigm with participant and item variability The top row shows little auto-correlation The bottom row shows conventional and hierarchical esti-mates of participantsrsquo sensitivity as functions of the true sensitivity (left)and hierarchical estimates as a function of the conventional estimates(right) True participant random effects were generated with a mirroreffect and true item random effects were generated with a shift in base-line familiarity

ndash4 ndash2 0 2 ndash1 0 1 2Effect on False Alarms Effect on False Alarms

Effe

ct o

n H

its

Effe

ct o

n H

its

Participants Items

1

0

ndash1

ndash2

2

1

0

ndash1

ndash2

Figure 22 Relationship between estimated random effects when trueparticipant effects were generated with a mirror effect and true item ef-fects were generated with a shift in baseline familiarity (Left) Estimatedparticipant effects on hit rate as a function of their effects on false alarmrate reveal a negative correlation (Right) Estimated item effects reveala positive correlation

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 25: Rouder & Lu (2005)

HIERARCHICAL BAYESIAN MODELS 597

baseline-familiarity effect In sum the hierarchical modeloffers detailed and relatively accurate analysis of partici-pant and item effects in signal detection

FUTURE DEVELOPMENT OF A SIGNALDETECTION MODEL

Expanded PriorsThe simulations reveal that there is some over-shrinkage

in the hierarchical model (see Figures 19 and 21) Wespeculate that the reason this comes about is that theprior assumes independence between αα and ββ and be-tween γγ and δδ This independence may be violated bynegative correlation from mirror effects or by positivecorrelation from criterion effects or baseline-familiarityeffects The prior is neutral in that it does not favor eitherof these effects but it is informative in that it is not con-cordant with either A less informative alternative wouldnot assume independence We seek a prior in which mir-ror effects and baseline-familiarity effects as well as alack thereof are all equally likely

A prior that allows for correlations is given as follows

and

The covariance matrices ΣΣ(i) and ΣΣ( j) allow for arbi-trary correlations of participant and item effects respec-tively on hit and false alarm rates The prior on the co-variance matrix elements must be specified A suitablechoice is that the prior on the off-diagonal elements bediffuse and thus able to account for any degree and di-rection of correlation One possible prior for covariancematrices is the Wishart distribution (Wishart 1928)which is a conjugate form Lu (2004) developed a modelof correlated participant and item effects in a related ap-plication Jacobyrsquos (1991) process dissociation modelHe provided conditional posteriors as well as an MCMCimplementation This approach looks promising but sig-nificant development and analysis in this context are stillneeded

Fortunately the present independent-prior model is suf-ficient for asymptotically consistent analysis The advan-tage of the correlated prior discussed above would betwofold First there may be some increase in the accuracyof sensitivity estimates The effect would emerge for ex-treme participants for whom there is a tendency in the independent-prior model to over-shrink estimates Thesecond benefit may be increased ability to assess mirrorand baseline-familiarity effects The present independent-prior model does not give sufficient credence to either ofthese effects hence its estimates of true correlations arebiased toward 0 A correlated prior would lead to more

power in detecting mirror and baseline-familiarity effectsshould they be present

Unequal Variance Signal Detection ModelIn the present model the variances of old- and new-

item familiarity distributions are assumed to be equalThis assumption is fairly standard in many studies Re-searchers who use signal detection in memory do so be-cause it is a quick and principled psychometric model tomeasure sensitivity without a detailed commitment tothe underlying processing Our models are presentedwith this spirit they allow researchers to assess sensitiv-ity without undue influence from item variability

There is some evidence that the equal-variance as-sumption is not appropriate The experimental approachfor testing the equal-variance assumption is to explicitlymanipulate criteria and to trace out the receiver-operatingcharacteristic (Macmillan amp Creelman 1991) RatcliffSheu and Gronlund (1992) followed this approach in arecognition memory experiment and found that the stan-dard deviation of familiarity from the old-item distribu-tion was 8 that of the new-item distribution This resulttaken at face value provides motivation for future devel-opment It seems feasible to construct a model in whichthe variance of the old-item familiarity distribution is afree parameter This may be accomplished by adding avariance parameter to the probit transform of hit rateWithin the context of such a model then the variancemay be simultaneously estimated along with overall sen-sitivity participant effects and item effects

GENERAL DISCUSSION

This article presents an introduction to Bayesian hier-archical models The main goals of this article have beento (1) highlight the dangers of aggregating data in nonlin-ear contexts (2) demonstrate that Bayesian hierarchicalmodels are a feasible approach to modeling participantand item variability simultaneously and (3) implement aBayesian hierarchical signal detection model for recogni-tion memory paradigms Our concluding discussion fo-cuses on a few select issues in Bayesian modeling sub-jectivity of prior distributions model selection pitfallsin Bayesian analysis and the contribution of Bayesiantechniques to the debate about the usefulness of signifi-cance tests

Subjectivity of Prior DistributionsBayesian analysis depends on prior distributions so it

is prudent to wonder whether these distributions add toomuch subjectivity into analysis We believe that priorscan be chosen that enhance analysis without being overlysubjective In many situations it is possible to choosepriors that result in posteriors that exactly match fre-quentist sampling distributions For example Bayesianestimation of the probability of success in a binomialdistribution is the proportion of successes if the prior isa beta with parameters a b 0 A second example is

γ

δj

j

⎝⎜

⎠⎟ ( )ind

~ Bivariate Normal 0 ( ) ΣΣ j

αβ

i

i

⎝⎜⎞

⎠⎟( )ind

~ Bivariate Normal 0 i ΣΣ( )

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 26: Rouder & Lu (2005)

598 ROUDER AND LU

from the normal distribution The posterior of the popu-lation mean is t distributed with the appropriate degreesof freedom when the prior on the mean is flat and theprior on variance is 1σ 2

Although researchers may choose noninformativepriors it may be prudent in some situations to choosevaguely informative priors There are two general rea-sons to consider an informative prior First vaguely in-formative priors are often more convenient than nonin-formative ones and second they may enhance estimationin those cases in which preexperimental knowledge iswell established We will examine these reasons in turn

There are two ways in which vaguely informative pri-ors are more convenient than noninformative ones Firstthe use of vaguely informative conjugate priors greatlyfacilitates the derivation of posterior conditional distrib-utions This in turn allows for rapid sampling of condi-tionals in the Gibbs sampler in estimating marginal pos-terior distributions Second the vaguely informativepriors here were proper and this propriety guaranteedthe propriety of the posteriors The disadvantage of in-formative priors is the possibility that the choice of theprior will unduly influence the posterior In the applica-tions we presented this did not occur for the chosen pa-rameters of the prior Increasing the variance of the pri-ors above the chosen values has minuscule effects onestimates for reasonably sized data sets

Although we highlight the convenience of vaguely in-formative proper priors researchers may also choose non-imformative priors Noninformative priors however areoften improper The consequences of this choice are that(1) the researcher must prove the propriety of the posteri-ors and (2) MCMC may need to be implemented with thesomewhat more complicated MetropolisndashHastings algo-rithm rather than with Gibbs sampling Neither of these ac-tivities are prohibitive but they do require additional skilland effort For the models presented here vaguely infor-mative conjugate priors were sufficient for valid analysis

The second reason to use informative priors is that insome applications researchers have a reasonable expec-tation of the range of data In the signal detection exam-ple we placed a normal prior on d prime that was centered at0 and had a very wide variance More statistical powercould have been obtained by using a prior with a major-ity of its mass above 0 and below 6 but this increase inpower would have been slight In estimating a responseprobability in a two-choice psychophysics experimentit is advantageous to place a prior with more mass above5 than below it The hierarchical priors advanced in thisarticle can be viewed as a form of prior information Thisis information that people are not arbitrarily dissimilarmdashin other words that true parameters come from orderlyparent distributions (an alternative view however is pre-sented by Lee amp Webb 2005) Judicious use of infor-mative priors can enhance estimation

We argue that the subjectivity in choosing priors is nogreater than other elements of subjectivity in research

On the contrary it may even be less so Researchers rou-tinely make seemingly arbitrary choices during thecourse of design and analysis For example researchersusing response times typically discard those outside anarbitrary window as being too extreme to reflect the pro-cess of interest Participants themselves may be dis-carded for excessive errors or for not following direc-tions The determination of a response-time window anerror-prone participant or a participant not following di-rections imposes a fair degree of subjectivity on analy-sis Within this context the subjectivity of Bayesian pri-ors seems benign

Bayesian analysis may be less subjective because itcan limit the need for other subjective practices Priors ineffect serve as a filter for cleaning data In the presentexample hierarchical priors mitigate the need to censorextremely poor-performing participants The estimatesfrom these participants are adjusted upward because ofthe influence of the prior (see Figure 15) Likewise inour hierarchical response time models (Rouder et al2005 Rouder et al 2003) there is no need to truncatelong response times The impact of these extreme obser-vations is mediated by the prior Hierarchical priors maybe less arbitrary than other selection practices becausethe researcher does not have to specify criterial valuesfor selection

Although the use of priors may be advantageous itdoes not come without risksmdashthe main one being thatthe choice of a prior could unduly influence the outcomeThe straightforward method of assessing this risk is toperform analyses repeatedly with a few different priorsThe posterior will always be affected somewhat by thechoice of prior If this effect is marginal then the result-ing inference can be accepted with greater confidenceThis need for assessing the effects of the prior on theposterior is similar in spirit to the safeguards required infrequentist analysis For example when truncating datathe researcher is obligated to explore the effects of dif-ferent points of truncation in order to ensure that thisfairly arbitrary decision only marginally affects the results

Model SelectionWe have not covered model testing or model selection

in this article One of the negative consequences of usinghierarchical models is that individual parameter esti-mates are no longer independent The value of an esti-mate for any participant depends in part on the valuesof others Accordingly it is invalid to analyze these pa-rameters with ANOVA or ordinary least-squares regres-sion to test hypotheses about groups or experimental manipulations

One proper method of model selection (which includeshypothesis testing) is to compute and compare Bayesfactors This approach has been discussed extensively inthe statistical literature (see Kass amp Raftery 1995 Mengamp Wong 1996) and has been imported to the psycho-

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 27: Rouder & Lu (2005)

HIERARCHICAL BAYESIAN MODELS 599

logical literature (see eg Pitt Myung amp Zhang 2003)When using a Bayes factor approach one computes theodds of one hypothesis being true relative to another hy-pothesis Unlike estimation computing Bayes factors iscomplicated and a discussion is beyond the scope of thisarticle We anticipate that Bayes factors will play an in-creasing role in inference with nonlinear models

Bayesian PitfallsThere are a few special pitfalls of Bayesian analysis

that deserve mention First there is a great temptation touse improper priors wherever possible The main prob-lem associated with improper priors is that without ana-lytic proof there is no guarantee that the resulting pos-teriors will be proper To complicate the situationMCMC sampling can be done unwittingly from im-proper posterior distributions even though the analysis ismeaningless Researchers choosing improper priorsshould provide evidence that the resulting posterior isproper The propriety of the posterior is not an issue if re-searchers choose proper priors but the use of proper pri-ors is not foolproof either In some situations the vari-ance of the posterior is directly related to the variance ofthe prior without constraint from the data (Hobert ampCasella 1996) This unacceptable situation is an exampleof undue influence of the prior When this happens themodel under consideration is most likely not appropriatefor analysis Another pitfall is that in many hierarchicalsituations MCMC integration converges slowly (seeFigure 18) In these cases researchers who do not checkfor convergence may fail to estimate true posterior dis-tributions In sum although Bayesian approaches areconceptually simple and are easily implemented in high-level packages researchers must approach both the is-sues of choosing a prior and checking the ensuing analy-sis with thoughtfulness and care

Bayesian Analysis and Significance TestsOver the last 40 years researchers have offered vari-

ous critiques of null-hypothesis significance testing (seeeg Cohen 1994 Cumming amp Finch 2001 Hunter1997 Rozeboom 1960 Smithson 2001 Steiger ampFouladi 1997) Some authors offer Bayesian inferenceas an alternative on the basis of its philosophical under-pinnings (Lee amp Wagenmakers 2005 Pruzek 1997Rindskopf 1997) Our rationale for advocating Bayesiananalysis is different We adopt the Bayesian approach outof practicality Simply put we know how to analyzethese nonlinear hierarchical models in the Bayesianframework alone Should there be tremendous advancesin frequentist nonlinear hierarchical models that maketheir implementation more practical than Bayesian oneswe would gladly consider these advances

Unlike those engaged in the significance test debatewe view both Bayesian and classical methods as havingsufficient theoretical grounding These methods are toolswhose applicability depends on the situation In many

experimental situations researchers interested in themean difference between conditions or groups can safelydraw inferences using classical tools such as t testsANOVA and ordinary least-squares regression Thoseworrying about robustness power and control of Type Ierror can increase the validity of analysis by replicationIn many simple cases Bayesian analysis is numericallyequivalent to classical analyses We are not advocatingthat researchers discard useful and convenient tools suchas null-hypothesis significance tests within ANOVA andregression models we are advocating that they considermodeling multiple sources of variability in analysis es-pecially in nonlinear contexts In sum the usefulness ofthe Bayesian approach is its tractability in these morecomplicated settings

REFERENCES

Ahrens J H amp Dieter U (1974) Computer methods for samplingfrom gamma beta Poisson and binomial distributions Computing12 223-246

Ahrens J H amp Dieter U (1982) Generating gamma variates by amodified rejection technique Communications of the Association forComputing Machinery 25 47-54

Albert J amp Chib S (1995) Bayesian residual analysis for binary re-sponse regression models Biometrika 82 747-759

Ashby F G (1992) Multivariate probability distributions In F GAshby (Ed) Multidimensional models of perception and cognition(pp 1-34) Hillsdale NJ Erlbaum

Ashby F G Maddox W T amp Lee W W (1994) On the dangers ofaveraging across subjects when using multidimensional scaling or thesimilarity-choice model Psychological Science 5 144-151

Baayen R H Tweedie F J amp Schreuder R (2002) The subjectsas a simple random effect fallacy Subject variability and morpho-logical family effects in the mental lexicon Brain amp Language 8155-65

Bayes T (1763) An essay towards solving a problem in the doctrineof chances Philosophical Transactions of the Royal Society of Lon-don 53 370-418

Clark H H (1973) The language-as-fixed-effect fallacy A critiqueof language statistics in psychological research Journal of VerbalLearning amp Verbal Behavior 12 335-359

Cohen J (1994) The earth is round ( p 05) American Psychologist49 997-1003

Cumming G amp Finch S (2001) A primer on the understanding useand calculation of confidence intervals that are based on central andnoncentral distributions Educational amp Psychological Measurement61 532-574

Curran T C amp Hintzman D L (1995) Violations of the indepen-dence assumption in process dissociation Journal of ExperimentalPsychology Learning Memory amp Cognition 21 531-547

Egan J P (1975) Signal detection theory and ROC analysis NewYork Academic Press

Einstein G O McDaniel M A amp Lackey S (1989) Bizarre im-agery interference and distinctiveness Journal of Experimental Psy-chology Learning Memory amp Cognition 15 137-146

Estes W K (1956) The problem of inference from curves based ongrouped data Psychological Bulletin 53 134-140

Forster K I amp Dickinson R G (1976) More on the language-as-fixed-effect fallacy Monte Carlo estimates of error rates for F1 F2F prime and min F prime Journal of Verbal Learning amp Verbal Behavior 15135-142

Gelfand A E amp Smith A F M (1990) Sampling-based approachesto calculating marginal densities Journal of the American StatisticalAssociation 85 398-409

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 28: Rouder & Lu (2005)

600 ROUDER AND LU

Gelman A Carlin J B Stern H S amp Rubin D B (2004)Bayesian data analysis (2nd ed) London Chapman amp Hall

Gelman A amp Rubin D B (1992) Inference from iterative simula-tion using multiple sequences (with discussion) Statistical Science7 457-511

Geman S amp Geman D (1984) Stochastic relaxation Gibbs distri-bution and the Bayesian restoration of images IEEE Transactionson Pattern Analysis amp Machine Intelligence 6 721-741

Geweke J (1992) Evaluating the accuracy of sampling-based ap-proaches to the calculation of posterior moments In J M BernardoJ O Berger A P Dawid amp A F M Smith (Eds) Bayesian statis-tics (Vol 4 pp 169-194) Oxford Oxford University Press Claren-don Press

Gilks W R Richardson S E amp Spiegelhalter D J (1996)Markov chain Monte Carlo in practice London Chapman amp Hall

Gill J (2002) Bayesian methods A social and behavioral sciences ap-proach London Chapman amp Hall

Gilmore G C Hersh H Caramazza A amp Griffin J (1979)Multidimensional letter similarity derived from recognition errorsPerception amp Psychophysics 25 425-431

Glanzer M Adams J K Iverson G J amp Kim K (1993) The reg-ularities of recognition memory Psychological Review 100 546-567

Green D M amp Swets J A (1966) Signal detection theory andpsychophysics New York Wiley

Greenwald A G Draine S C amp Abrams R L (1996) Three cog-nitive markers of unconscious semantic activation Science 2731699-1702

Haider H amp Frensch P A (2002) Why aggregated learning followsthe power law of practice when individual learning does not Com-ment on Rickard (1997 1999) Delaney et al (1998) and Palmeri(1999) Journal of Experimental Psychology Learning Memory ampCognition 28 392-406

Heathcote A Brown S amp Mewhort D J K (2000) The powerlaw repealed The case for an exponential law of practice Psycho-nomic Bulletin amp Review 7 185-207

Hirshman E Whelley M M amp Palij M (1989) An investigationof paradoxical memory effects Journal of Memory amp Language 28594-609

Hobert J P amp Casella G (1996) The effect of improper priors onGibbs sampling in hierarchical linear mixed models Journal of theAmerican Statistical Association 91 1461-1473

Hohle R H (1965) Inferred components of reaction time as a func-tion of foreperiod duration Journal of Experimental Psychology 69382-386

Hunter J E (1997) Needed A ban on the significance test Psycho-logical Science 8 3-7

Jacoby L L (1991) A process dissociation framework Separating au-tomatic from intentional uses of memory Journal of Memory amp Lan-guage 30 513-541

Jeffreys H (1961) Theory of probability (3rd ed) New York OxfordUniversity Press

Kass R E amp Raftery A E (1995) Bayes factors Journal of theAmerican Statistical Association 90 773-795

Kreft I amp de Leeuw J (1998) Introducing multilevel modelingLondon Sage

Lee M D amp Wagenmakers E J (2005) Bayesian statistical infer-ence in psychology Comment on Trafimow (2003) PsychologicalReview 112 662-668

Lee M D amp Webb M R (2005) Modeling individual differences incognition Manuscript submitted for publication

Lu J (2004) Bayesian hierarchical models for process dissociationframework in memory research Unpublished manuscript

Luce R D (1963) Detection and recognition In R D Luce R RBush amp E Galanter (Eds) Handbook of mathematical psychology(Vol 1 pp 103-189) New York Wiley

Macmillan N A amp Creelman C D (1991) Detection theory Auserrsquos guide Cambridge Cambridge University Press

Massaro D W amp Oden G C (1979) Integration of featural infor-mation in speech perception Psychological Review 85 172-191

McClelland J L amp Rumelhart D E (1981) An interactive acti-vation model of context effects in letter perception I An account ofbasic findings Psychological Review 88 375-407

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Meng X amp Wong W H (1996) Simulating ratios of normalizingconstants via a simple identity A theoretical exploration StatisticaSinica 6 831-860

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Pitt M A Myung I-J amp Zhang S (2003) Toward a method of se-lecting among computational models of cognition Psychological Re-view 109 472-491

Pra Baldi A de Beni R Cornoldi C amp Cavedon A (1985)Some conditions of the occurrence of the bizarreness effect in recallBritish Journal of Psychology 76 427-436

Pruzek R M (1997) An introduction to Bayesian inference and itsapplications In L Harlow S Mulaik amp J Steiger (Eds) What if there were no significance tests (pp 221-257) Mahwah NJ Erlbaum

Raaijmakers J G W Schrijnemakers J M C amp Gremmen F(1999) How to deal with ldquothe language-as-fixed-effect fallacyrdquoCommon misconceptions and alternative solutions Journal of Mem-ory amp Language 41 416-426

Raftery A E amp Lewis S M (1992) One long run with diagnosticsImplementation strategies for Markov chain Monte Carlo StatisticalScience 7 493-497

Ratcliff R (1978) A theory of memory retrieval Psychological Re-view 85 59-108

Ratcliff R amp Rouder J N (1998) Modeling response times for de-cisions between two choices Psychological Science 9 347-356

Ratcliff R Sheu C-F amp Gronlund S D (1992) Testing globalmemory models using ROC curves Psychological Review 99 518-535

Riefer D M amp Rouder J N (1992) A multinomial modeling analy-sis of the mnemonic benefits of bizarre imagery Memory amp Cogni-tion 20 601-611

Rindskopf R M (1997) Testing ldquosmallrdquo not null hypotheses Clas-sical and Bayesian approaches In L Harlow S Mulaik amp J Steiger(Eds) What if there were no significance tests (pp 221-257) Mah-wah NJ Erlbaum

Roberts G O amp Sahu S K (1997) Updating schemes correlationstructure blocking and parameterization for the Gibbs sampler Jour-nal of the Royal Statistical Society Series B 59 291-317

Rouder J N (2000) Assessing the roles of change discrimination andluminance integration Evidence for a hybrid race model of percep-tual decision making in luminance discrimination Journal of Exper-imental Psychology Human Perception amp Performance 26 359-378

Rouder J N Lu J Speckman P [L] Sun D amp Jiang Y (2005)A hierarchical model for estimating response time distributions Psy-chonomic Bulletin amp Review 12 195-223

Rouder J N Sun D Speckman P L Lu J amp Zhou D (2003) Ahierarchical Bayesian statistical framework for response time distri-butions Psychometrika 68 589-606

Rozeboom W W (1960) The fallacy of the null-hypothesis signifi-cance test Psychological Bulletin 57 416-428

Smithson M (2001) Correct confidence intervals for various regres-sion effect sizes and parameters The importance of noncentral dis-tributions in computing intervals Educational amp Psychological Mea-surement 61 605-632

Steiger J H amp Fouladi R T (1997) Noncentrality interval esti-mation and the evaluation of statistical models In L Harlow S Mu-laik amp J Steiger (Eds) What if there were no significance tests(pp 221-257) Mahwah NJ Erlbaum

Tanner M A (1996) Tools for statistical inference Methods for theexploration of posterior distributions and likelihood functions (3rded) New York Springer

Tierney L (1994) Markov chains for exploring posterior distribu-tions Annals of Statistics 22 1701-1728

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 29: Rouder & Lu (2005)

HIERARCHICAL BAYESIAN MODELS 601

APPENDIX A

This appendix describes the Albert and Chib (1995) method for estimating a probit-transformed probabil-ity Let xj be 1 if the participant is successful on the jth trial and 0 otherwise For each trial a latent variablewj is defined as follows

(A1)

Latent variables w are never known All that is known is their sign If a trial j was successful it was posi-tive otherwise it was nonpositive Although w is not known its construction is nonetheless essential

Without any loss of generality the random variable wj is assumed to be a normal with a variance of 10 ForEquation A1 to hold the mean of these normals needs be z where z Φ1( p)A1 Therefore

This construction of wj is shown in Figure A1a The parameters of interest are z and w (w1 wJ) thedata are (x1 xJ) Note that if the latent variables w were known the problem of estimating z would besimple Parameter z is simply the population mean of w and could be estimated accordingly The fact that wis latent will not prove to be a stumbling block

The goal is to derive the marginal posterior of z To do so conditional posterior distributions are derivedand then integrated using Gibbs sampling The desired conditionals are z | w and wj | z xj j 1 J Westart with the former Parameter z is the population mean of w hence this case is that of estimating a popu-lation mean from normally distributed observations (w) The prior on z is a normal with parameters μ0 andσ 0

2 so the previous derivation is applicable A direct application of Equations 23 and 24 yields that the con-ditional posterior z | w is a normal with parameters

(A2)

and

(A3)

The conditional wj | z xj is only slightly more complex If xj is unknown latent variable wj is normally dis-tributed and centered at z When wj is conditioned on xj the sign of wj is known If xj 0 then by Equation A1wj must be negative Hence the conditional wj | z xj is a normal density that is truncated above at 0 (Fig-ure A1b) Likewise if xj 1 then wj must be positive and the conditional is a normal truncated below at 0(Figure A1c) The conditional posterior therefore is a truncated normal truncated either above or below de-pending on whether or not the trial was answered successfully

Once conditionals are specified analysis proceeds with Gibbs sampling Sampling from a normal isstraightforward sampling from a truncated normal is not too difficult and an algorithm for such sampling iscoded in Appendix C

σσ

primeminus

= +⎛

⎝⎜

⎠⎟

2

02

1

1N

prime = +⎛

⎝⎜

⎠⎟ +

⎝⎜

⎠⎟

minus

μσ

μσ

N Nw1

02

1

0

02

w zjind~ Normal 1( )

xw

wj

j

j

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

Wickelgren W A (1968) Unidimensional strength theory and com-ponent analysis of noise in absolute and comparative judgmentsJournal of Mathematical Psychology 5 102-122

Wishart J (1928) A generalized product moment distribution in sam-ples from normal multivariate population Biometrika 20 32-52

Wollen K A amp Cox S D (1981) Sentence cueing and the effect ofbizarre imagery Journal of Experimental Psychology Human Learn-ing amp Memory 7 386-392

NOTES

1

where Γ is the generalized factorial function for example Γ(n) (n 1)

2 True probabilities for individuals were obtained by sampling a betadistribution with parameters a 35 and b 15

Be( )( ) ( )( )

a ba ba b

=+

Γ ΓΓ

(Continued on next page)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 30: Rouder & Lu (2005)

602 ROUDER AND LU

APPENDIX B

Gibbs sampling for the hierarchical model is based on the Albert and Chib (1995) method as follows Letxij be the indicator for the ith person on the jth trial xij 1 if the trial is successful and 0 otherwise Latentvariable wij is constructed as

(B1)

Conditional posterior wij | zi xij is a truncated normal as was discussed in Appendix A Conditional posteriorzi | w μ σ 2 is a normal with mean and variance given in Equations A2 and A3 respectively (with the substi-tution of μ for μ0 and σ 2 for σ 0

2) Conditional posterior densities of μ | σ 2 z and σ 2 | μ z are given in Equa-tions 25 and 28 respectively

xw

wij

ij

ij

=ge

lt

⎧⎨⎪

⎩⎪

1 0

0 0

APPENDIX A (Continued)

Figure A1 The Albert and Chib (1995) construction of a trial-specific latent vari-able for probit transforms (a) Construction of the unconditional latent variable wj forp 84 (b) Conditional distribution of wj on a failure on trial j In this case wj mustbe negative (c) Conditional distribution of wj on a success on trial j In this case wjmust be positive

NOTE TO APPENDIX A

A1 Let μw be the mean of w Then by construction

Random variable wj is a normal with variance of 10 The term Pr(wj 0) is its cumulative distribution function evaluatedat 0 Therefore

Combining this equality with the one above yields

and by symmetry of the normal

μw p z= =minusΦ 1( )

Φ minus( ) = minusμw p1

Pr wj w wlt( ) = minus( )⎡⎣ ⎤⎦ = minus( )0 0 1Φ Φμ μ

Pr Pr w x pj jlt( ) = =( ) = minus0 0 1

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 31: Rouder & Lu (2005)

HIERARCHICAL BAYESIAN MODELS 603

APPENDIX C

This appendix provides code in R for the hierarchical analysis of several individualsrsquo probabilities R is afreely available easy-to-install open-source statistical package based on SPlus It runs on Windows Macin-tosh and UNIX platforms and can be downloaded from wwwR-projectorg

functions to sample from a truncated normal-------------------------------------------generates n samples of a normal (b1) truncated below zerortnpos=function(nb)u=runif(n)b+qnorm(1-upnorm(b))generates n samples of a normal (b1) truncated above zerortnneg=function(nb)u=runif(n)b+qnorm(upnorm(-b))-----------------------------generate true values and data-----------------------------I=20 Number of participantsJ=50 Number of Trials per participantgenerate each individualrsquos true pp=rbeta(I3515)result use scan() to enter06932272 06956900 06330869 06504598 07008421 06589233 0733730506666958 05867875 07132572 07863823 06869082 06665865 0742691006649086 06212024 07375715 08025261 07587382 07363251generate each individualrsquos observed numbers of successesy=rbinom(IJp)result use scan() to enter34 34 34 29 37 32 38 29 33 38 38 34 30 37 37 28 33 42 33 37----------------------------------------Analysis Set UpM=10000 total number of MCMC iterationsmyrange=101M portion kept burn-in of 100 iterations discardedneeded vectors and matricesz=matrix(ncol=Inrow=M)matrix of each personrsquos z at each iterationnote z is probit of p eg z=qnorm(p)sigma2=1Mmu=1Mprior parameterssigma0=100 this parameter is a variance sigma_0^2mu0=0a=1b=1initial valuessigma2[1]=1mu[1]=1z[1]=rep(1I)shape for inverse gamma calculated outside of loop for speedshape=(I)2+a-----------------------------Main MCMC loop-----------------------------for (m in 2M) iterationfor (i in 1I) participantw=c(rtnpos(y[i]z[m-1i])rtnneg(J-y[i]z[m-1i])) samples of w|yzprec=J+1sigma2[m-1]meanz=(sum(w)+mu[m-1]sigma2[m-1])prec

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)

Page 32: Rouder & Lu (2005)

604 ROUDER AND LU

APPENDIX C (Continued)

z[mi]=rnorm(1meanzsqrt(1prec)) sample of z|wmusigma2prec=Isigma2[m-1]+1sigma0meanmu=(sum(z[m])sigma2[m-1]+mu0sigma0)precmu[m]=rnorm(1meanmusqrt(1prec)) sample of mu|zrate=sum((z[m]-mu[m])^2)2+bsigma2[m]=1rgamma(1shape=shapescale=1rate) sample of sigma2|z-----------------------------------------Gather estimates of p for each individualallpest=pnorm(z[myrange]) MCMC chains of ppesth=apply(allpest2mean) posterior means

(Manuscript received May 25 2004revision accepted for publication January 12 2005)